Machine Learning: The CitiBike Station Rebalancing Issue

, and
Posted on Mar 27, 2020
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Links: GitHub | Presentation | CitiBike Data


Have you ever ridden a CitiBike in New York City, only to reach your destination and found no docks available to return your bike? Or perhaps you’ve frustratedly searched on your CitiBike app to locate the second -- or worse, the third -- closest station to you because the one around the corner from your apartment is empty.

For those unfamiliar with the CitiBike experience, these are issues that many New Yorkers face when relying upon the world’s largest bike share program for their daily commute. In a nutshell, poor bike and return dock availability continues to be a key issue behind Citibike customer retention, despite CitiBike’s attempts to move bikes from one hotspot to another throughout the day.

Our Solution

Our group, a team of data scientists, decided to re-evaluate CitiBike’s station rebalancing strategy. We developed an application in Python that uses machine learning to create a strategy to rebalance bikes from full/near-full stations to empty/near-empty to full/near-full stations (and vice versa!).

Given a date and time, we can predict (1) the outgoing and incoming bike demand and (2) the depletion status for each CitiBike station in New York City. Our application uses these predictions to dynamically generate a paired list of stations to/from which to move bikes to ensure maximum rider fulfillment. Below, we lay out our framework for how we analyzed and cleaned the data and how we developed our application.

Our Project Workflow

  1. Exploratory Data Analysis
  2. Data Processing
  3. Modeling & Results
  4. Final Conclusion

Exploratory Data Analysis

Understanding the Data

We utilized two datasets: CitiBike’s system data and The Open Bus bike share data. The CitiBike data consisted of a collection of monthly datasets spanning June 2013 to January 2020 with over 80 million total observations of individual rides across 15 variables; the Open Bus data was made up of monthly datasets from March 2015 to April 2019 with over 35 million total snapshot observations of station status across 14 variables. Below is a quick overview of the insights derived from exploratory analysis into the two datasets.

CitiBike Trips Data Insights

In 2019, there were 856 total active stations in New York City (excluding Jersey City) and roughly 12,800 bikes on the platform. According to the data, September had the highest number of average trips taken, closely followed by the summer months (June-August).

As can be seen, ridership peaks on weekdays during the morning and afternoon rush hour periods (8am-9am, 5pm-6pm), with the largest proportion of CitiBike trips taking over 15 minutes. Presumably, this represents many New Yorkers’ daily commute.

Additionally, we saw that the number of trips by start station came from heavily trafficked areas of Manhattan, including Midtown and the Lower East Side. In fact, the station with the highest number of trips is Grand Central Terminal (Pershing Square North station). Moreover, the station that saw the highest average trip duration is in the Lower East Side (Rivington & Ridge station), a fact we believe is driven by CitiBike users looking to enter Brooklyn from Manhattan taking advantage of the station’s direct access to the borough via the Williamsburg Bridge.

Open Bus Station Data Insights

Recall that the Open Bus dataset was used to glean useful insights about the CitiBike dock stations. These insights corroborated the information from the CitiBike trips dataset: we observed that station utilization (bikes used/total docks) also peaked around the morning and afternoon rush hour periods.

Interestingly, the data showed that station utilization was actually higher during the spring season (March-May) rather than the summer, which might have been expected. We believe that this is due to CitiBike’s policy of changing its “active” fleet across seasons.

Our research showed that CitiBike removes bikes from the active fleet -- the number of bikes it has stationed in docks for use at any given time -- for repair and maintenance, but also to control the supply of bikes needed at times to meet overall demand. The higher station utilization in the spring can be explained by the larger active fleet CitiBike maintains in the summer -- more bikes leads to a smaller utilization percentage for any given station than does fewer bikes.

Data Processing

Individual Rides Dataset

For this project we restricted our data to the period where both datasets overlapped -- 2015 to 2019. The various monthly datasets from the CitiBike systems database were combined into a single database and processed.

Since the original data file had over 80 million rows, we utilized Dask’s multi-core processing to optimize Pandas’s performance. Overall, we added the following columns, derived from the original data, to the final dataset with all trips:

  • Starttime_interval
  • Stoptime_interval
  • Season
  • Dayofweek

We then broke the original dataset into two datasets: outgoing rides and incoming rides by grouping the data by station, date, and time interval to aggregate outgoing bike counts and incoming bike counts for each station at each time.

Next, we generated the target variables for the Outgoing dataset and Incoming datasets with which to train our machine learning models: Bike Demand and Dock Demand, respectively. We classified these variables by looking at the distribution of the outgoing bike count and incoming bike count from the two datasets and marking the top 25th percentile as ‘High’, the bottom 25th percentile as ‘Low,’ and the rest as ‘Medium.’

Machine learning Dataset analyzing

After generating these features, we analyzed the new dataset and found that there was, on average, a higher dock demand during the morning rush hour than during the afternoon rush hour, while afternoon rush hour bike demand outpaced morning bike demand.

Intuitively, this discovery makes sense, as commuters rushing into the office in the morning spikes dock demand; after work, the same commuters look for bikes to get home or to the subway stations. This imbalance also confirms our belief that a rebalancing strategy is needed to mitigate the issue of ‘no docks in the morning’ or ‘no bikes in the afternoon.’

Another takeaway is that the stations with high dock demands in the morning are also very similar to the stations with high bike demands in the afternoon. Intuitively this makes sense as most subscribers are commuters and they need to dock the bike near work in the morning rush hour and conversely need to take bikes from those same locations during the afternoon rush hour. The below maps demonstrate the dock demand and bike demand for all stations to help visualize where most commuters take bikes to and from on a daily basis.

Stations Dataset

The Open Bus dataset, containing snapshots about the individual stations at any given time, was much messier than the CitiBike individual rides data. Much of the data processing work went into parsing individual columns and handling incorrectly entered data. However, once the dataset had been cleaned, we followed a similar data processing procedure for this dataset to the one used for the individual rides dataset.

Because the snapshots of station health happened at inconsistent times and roughly at a rate of two to three snapshots per hour, we chose to group observations into half hour increments. We also created variables to indicate the day of the week and the season of each observation, to align the methodology with the individual trips dataset.

Using this, we created a variable indicating the depletion status of each station at the snapshot, which was calculated by dividing the available bikes at that time by the total docks at the given station. This percentage was used to classify stations into three risks: ‘Full Risk’ if the station’s depletion status percentage was 66% or higher, ‘Empty Risk’ if it was 33% or lower, or ‘Healthy’ otherwise.

An exploratory visualization, below, at the processed station data indicates that the vast majority of the stations are, on average, are classified as ‘Empty Risk’ and therefore likely in need of rebalancing.


Machine Modeling & Rebalancing Results

Once the data processing was completed, we were able to turn our attention to our machine learning model and rebalancing algorithm.

Machine Learning

We trained three Random Forest Classifier models on our processed datasets, one each to predict bike demand, dock demand, and depletion status for each station in New York City when provided a date and time in the future. We made sure to implement train-test splitting to minimize overfitting on our training data, and tuned the models as best as we were able given available processing power.

Rebalancing Strategy

Once we had models trained to supply predictions for bike demand, dock demand, and depletion status for each station, we developed an iterative algorithm to identify which stations were in need of having bikes rebalanced out, which were in need of having bikes rebalanced in, and which were not in need of rebalancing at all.


Any station that our models identified as ‘Full Risk’ (again, a station that had many bikes but few open docks available) as well as having a medium-to-high dock demand (comparatively many incoming bikes) and a low-to-medium bike demand (comparatively few bikes going out) was classified as in need of having bikes rebalanced out.


Conversely, any station that our models identified as ‘Empty Risk’ (a station with few bikes but many open docks) as well as having a low-to-medium dock demand (comparatively few incoming bikes) and a medium-to-high bike demand (comparatively many bikes going out) was classified as in need of having bikes rebalanced in.

Need of Having Bikes

Stations that were identified as in need of having bikes brought in were paired with stations identified as in need of having bikes taken out based on distance - the closest stations (calculated using Manhattan distance to account for New York City’s block system and the fact that the bikes are rebalanced via car) were paired together. Once stations were paired, their depletion statuses were updated in the master table and they were removed from the list of stations to be rebalanced, in order to avoid double counting.

Finally, in the event where more stations needed rebalancing out than did rebalancing in (or vice versa), stations with the appropriate level of depletion status and medium demand for both bike demand and dock demand were used to rebalance the leftover stations.

The algorithm then outputs the results of the paired list as an easy-to-reference table using individual station IDs as identifiers. The results can also be mapped using the unique station. An example of both the output table and a preliminary mapping are shown below:machinemachine


Thank you so much for taking the time to read about our project! 

In the future, we’re interested in fleshing out the datasets with traffic and precipitation data, accounting for one-way streets in our rebalancing algorithm, and developing a user-interface for the app that allows for a more user-friendly approach to generating the paired list of stations from which/to rebalance bikes.

Please feel free to take a look at our GitHub accounts below if you’re interested in the code that went into making this project possible or our presentation deck. Alternatively, we can be found on LinkedIn as well.

Christian Opperman: LinkedIn | GitHub
Melanie Zheng: LinkedIn | GitHub
Paul Choi: LinkedIn | GitHub


About Authors

Christian Opperman

Christian Opperman is a data scientist and former project manager at Mitsui & Co., Ltd. with experience in technical writing, a bachelors in Economics, and a Masters in Writing.
View all posts by Christian Opperman >

Melanie Zheng

Melanie is currently enrolled in Georgia Institute of Technology for Master's Degree in Computer Science with Machine Learning specialization. She previously worked as a product manager at Viacom and project manager at Citigroup.
View all posts by Melanie Zheng >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI