Data shows Citi Bike Success Hangs in the Rebalancing

Posted on Oct 26, 2021

The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Data Science Background


Defining the Problem

Data shows that Citi Bike is one of the largest bike-sharing programs in the United States. It has over 1,500 stations and 25,000 bikes servicing Manhattan, Brooklyn, Queens, the Bronx, Jersey City and Hoboken. One of the largest problems that Citi Bike has is pooling (“riders return bikes but don’t unlock them”) and draining (“riders unlock bikes but don’t return them”) of bikes in specific high traffic areas.

This renders those bikes useless. If you go to a station that is completely empty, you cannot access  a bike. A completely full station is also a problem because it blocks you from returning your bike there. Citi Bike puts a lot of resources into undoing the pooling of bikes at full stations or replacing bikes at empty stations by either employing people to taxi the bikes from full to empty stations or by incentivizing riders to do it. 

Data Challenges

It is important for Citi Bike to know in advance what the traffic of their bicycles will be and which stations they expect to be full or empty to plan for what stations will need to be rebalanced. Citi Bike can also plan their future expansions for layout of stations that most optimally decreases any extreme surplus or shortage of bikes at a given station.

Data Sets

Citi Bike provides a monthly report of each trip that takes place between any of its stations. That report includes trip duration, start time, end time, what station the trip started at, and what station the trip ended at. For the purposes of this analysis, I only looked at the trip information from March and April of 2021 in New York City, which included over 3.5 million individual trips. Also included in this analysis is geographic information that is provided by NYC open data. This geographic data included zip code level maps of New York City, subway station maps, subway line Maps, and location of hospitals and universities.

Data Cleaning

The data provided latitude and longitude information for each station but to be able to aggregate information more easily the latitude and longitude information was mapped to the ZIP code information for each station. Trips that were of an extreme length were removed because these generally were trips where the bikes were lost or stolen. Trips that then started in New York City and ended in New Jersey, or vice versa, were excluded. These trips tended to take place near ferries. 

Exploratory Data Analysis

The largest amount of Citi Bike traffic is in lower Manhattan. The top five zip codes (10009, 10003, 10011, 10019, and 10002) all had around 300,000 trips that either started or ended in that zip code. Four of the top five most trafficked zip codes are in the Lower East Side or Chelsea area. There is one exception to this near Midtown West, which has a lot of tourist traffic from Time Squares and Central Park. Of the top 10 zip codes with the most traffic, none of them were in Brooklyn, Harlem, or Queens. This makes sense as Citi Bike started in Manhattan and then expanded into the other boroughs.

Figure 1. Lower Manhattan, specifically the Lower East Side, has the highest total Citi Bike trips.

I wanted to look at the times that individual stations have the most extreme surplus or shortages of bikes. To do this, I looked at when individual stations gained or lost three bikes in an hour span. This did not happen the vast majority of the time. Over ninety-nine percent of the time a station was in pretty good balance only gaining or losing less than three bikes in an hour. It is important to look at the less common periods of extreme draining or pooling because that is  when there is the greatest need to fix the problems of empty or full stations.

Figure 2 shows that surpluses and shortages of bikes happen simultaneously. This makes sense because the increased movement of bikes from one location to another can increase  surpluses in one area and shortages in another. The biggest spike in a surplus of bikes happens around noon. That’s when there are a lot of bikes going to an individual station. That can be explained by people congregating at work or school in the morning. The major spike for bike shortages happens around 10 p.m. each night. That again can be explained by people leaving a shared location, whether that is school, work, or some sort of nightlife to return home. The time for both of these was later than I was expecting. I was expecting people's commute to start around 9 a.m. and for their return trip  home to be around 5 p.m. Another explanation for this might be that the people who are predominantly using Citi Bike as a means of transportation might not have a conventional  nine-to-five work schedule. They might be students or healthcare workers or working nights.  Another explanation is that this is an error caused by the sample. The trip data only comes from March and April of 2021 when some Covid protocols were still in effect, so that might have affected when people were using transportation.


Figure 2. Extreme surpluses and shortages of bikes spike at noon and 10 p.m. Extreme surpluses are skewed towards noon, extreme shortages are skewed towards 10 p.m.


It is important to look at the times and locations of how people move together. In figure 3, we can see the same top 10 zip codes originally shown in figure 1. Figure 3 allows us to see when people are leaving each zip code and when people are coming back to those zip codes. The line graph on the left shows the net change in total bikes for all stations in a zip code. The overall shape is similar to what we saw in figure 2 where the majority of traffic happens around noon and 10 p.m. The majority of the zip codes that have a surplus or net gain of bikes at noon also had  a shortage or net loss of bikes at 10 p.m. and vice versa. This could be because these are neighborhoods where people are predominantly living or places where people are predominantly working.


Figure 3. The majority of bike traffic happens around noon and 10 p.m. Zip codes where people predominantly work gain bikes at noon and lose bikes at 10 p.m. Zip codes where people predominantly live lose bikes at noon and gain bikes at 10 p.m. This follows the commuting traffic pattern of leaving for work and coming back to home.


I want to draw your attention to the ZIP code that had the most extreme bike loss at noon and the most extreme bike gain at 10 p.m. Zip code 10009 in the East Village will function as an exemplar of a neighborhood where people predominantly leave at noon to go to work and return around 9 p.m.and peaking at 10 p.m, causing a surplus of bikes


Figure 4. The East Village has the largest loss of bikes at noon and the largest gain of bikes at 10 p.m. when compared to all other zip codes.


The East Village neighborhood net bike changes raises the question: How are individual bike stations in the East Village handling the huge influx of bikes around noon or 10 p.m.? My expectation was that because the East Village had the largest net change in bikes that they would also have stations that had the largest pooling and drainage of bikes. But we can see from figure 5 that the East Village at noon and 10 p.m. is handling the traffic very well. The East Village has so many stations spread across the area of demand that there isn't any one station that is experiencing a huge net change in bikes. 

The most worrisome station is in Kips Bay just north of the East Village. We can see that there is a single station at noon that gained on average more than 12 1/2 bikes and at 10 p.m. lost between 7 1/2 and 12 1/2  bikes. This bike station is located along 2nd Avenue next to a street with a bike lane. It is also one of the few bike stations that is located next to the hospital complex. 

data on citi bikes

Figure 5. Bike station demand is well dispersed across the East Village at noon and 10 p.m.


To get a look at a station that had some of the most extreme surpluses and shortages of bikes  we're now going to look at the Upper East Side. This neighborhood, while not being in the top 10 zip codes in terms of total trips, does have one of the most extreme behaviors in a given station. Figure 6 shows there's a bike station near a hospital complex that gains/loses more than 12 1/2 bikes on average  at noon and 10 p.m. respectively. This is a station that would be quickly overwhelmed, rendering it useless unless Citi Bike uses resources to immediately shuttle bikes from nearby stations. 

Hospitals make sense as places that would see large volumes of traffic at  centralized locations. There is a large volume of people who are all working at very similar times at hospitals.


Figure 6. Bike station demand is centralized across the Upper East Side around a hospital complex at noon and 10 p.m.


Discussion and Conclusion

This analysis provides a few things that Citi Bike can focus on when thinking about efficiently rebalancing. First, Citi Bike should focus their rebalancing at peak hours that are around noon and 10 p.m. Second, they should look at the existing high traffic areas and find strategies to disperse that high traffic across many stations in the area. The East Village is a great example of how they can disperse a huge amount of traffic coming in and out in peak hours with no individual station taking a huge hit. As Citi Bike is looking to do future expansion they should try to create good infrastructure near hospitals. Hospitals make sense as a location where there's a workforce that is socioeconomically more diverse and more health-conscious who may prefer to use Citi Bike as a form of transportation.

For future analysis I would like to have access to data showing the number of bikes an individual Citi Bike station can hold. That would allow for more complex analysis to determine if and when a station has accumulated or lost enough bikes that it would need to be rebalanced by Citi Bike. I also would want to take in more external data to see what other geographic features might influence the traffic of city bikes. Finally, I would formalize my analysis so that it is not as reliant on anecdotal observations from given Maps.



About Author

Hayden Warren

Hayden is an NYC Data Science Academy Fellow with a B.S. in Mathematics from the University of Utah. He then went on to work as a math teacher and debate coach, coaching multiple state champions. During this time...
View all posts by Hayden Warren >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI