Friday, August 22, 2014

The case of the missing Citibikes

The example of my friend and colleague Ben, with his amazing I Quant NY blog, has motivated me to try my hand at some open data hacking. Ben's written several posts where he analyzes Citibike bike share data. Citibike has made all of their trip data through the end of May 2014 available for free download. I'm a huge fan of New York's bike share program and of their open data policy.

Ben has analyzed trips and stations, but I have a different question: how many of New York's shared bikes have been stolen or lost?

The New York Post reports that bikes are routinely stolen from Manhattan stations and ridden to underserved parts of Brooklyn and Queens. I'm not too concerned about these bikes: they're recovered quickly, and Citibike may wish to treat the bikes' eventual destinations as a kind of desire line. Clearly Crown Heights residents can't wait for the program to expand to their neighborhood.

In July, the 109th precinct proudly reported that their detectives had detected a 68 year old man riding a Citibike which had been liberated and repainted. The suspected thief was detained and his ride confiscated. That's what I'm looking for!

So I got myself the trip data, put together a quick-and-dirty python script, and identified the first and last trip for each bike in the system. I presume that if a bike is stolen or destroyed it will disappear from the trip data, so we can guess that if a bike hasn't been ridden in some time, it's likely gone AWOL.

Note that the bikeid field in the data doesn't appear to match the number stenciled on the bike's frame. It could correspond to the electronic identifier (probably an RFID tag) which the stations use to identify bikes. If that's the case, missing trip data could simply indicate that the electronics were damaged and replaced.

There are 6943 unique numbers in the trip data. This is roughly consistent with a New York Times story, published when the program launched, reporting 6,000 bikes in the system.

If we sort the bikes by their final trip, we can quickly get an estimate of losses.

MonthFinal ridesMonthFinal rides
2013-07182014-0126
2013-08362014-0260
2013-09172014-0368
2013-10312014-04428
2013-11412014-056186
2013-1232
The vast majority of the bikes showed activity in May 2014, meaning that they weren't stolen or lost. Before April, each month saw between 17 and 68 final rides, averaging 36.5 each month.

At first glance, April appears to have been a disastrous month for Citibike thefts. But a more likely explanation could be that those bikes have been removed for maintenance. If Citibike keeps 300-400 bikes in their warehouse for routine tuneups, and if it takes two months for the bikes to rotate back out into service, it could easily explain most of the 428 bikes which were ridden in April, but idle in May. We would expect most of them to return to service in June. 

February and March also saw higher than average losses. Perhaps bike thieves are more active in those months, but this may be better explained by the unusually snowy winter. Plows, for example, may have taken a toll on the fleet.

Citibike can, at least in theory, bill a rider $1200 for failing to return a bike. If they collected this fee for each bike which went missing before April, they'd have raised nearly $400,000. However I've yet to hear of any rider receiving such a bill.

Assuming that these final trips do represent theft and loss, approximately half of 1% of Citibikes are lost each month, or about 6% every year. That's far better than the reputed 80% of Paris's VĂ©lib' bikes which were stolen in that system's first year!

Update: the original version of the table showed 116 trips which ended in June. This is because there were a handful of trips which started on May 31 but finished after midnight, and were thus credited to the next month. To make it less confusing, I've merged these final rides into the data for May.

4 comments:

  1. What on earth happened in 2014-05? The table shows 6070... can't be right...

    ReplyDelete
  2. 2014-05 is the last month for which we have full data, so those 6070 bikes were all active.

    ReplyDelete
  3. Then why do you have June data? Or do we need July data before May "stabilizes"?

    ReplyDelete
    Replies
    1. The May data includes some trips that ended on June 1 (presumably they started in May). Perhaps I should have sanitized these to end at midnight on May 31, instead.

      Delete