Machine Learning: The CitiBike Station Rebalancing Issue
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Links: GitHub | Presentation | CitiBike Data
Introduction
Have you ever ridden a CitiBike in New York City, only to reach your destination and found no docks available to return your bike? Or perhaps youโve frustratedly searched on your CitiBike app to locate the second -- or worse, the third -- closest station to you because the one around the corner from your apartment is empty.
For those unfamiliar with the CitiBike experience, these are issues that many New Yorkers face when relying upon the worldโs largest bike share program for their daily commute. In a nutshell, poor bike and return dock availability continues to be a key issue behind Citibike customer retention, despite CitiBikeโs attempts to move bikes from one hotspot to another throughout the day.
Our Solution
Our group, a team of data scientists, decided to re-evaluate CitiBikeโs station rebalancing strategy. We developed an application in Python that uses machine learning to create a strategy to rebalance bikes from full/near-full stations to empty/near-empty to full/near-full stations (and vice versa!).
Given a date and time, we can predict (1) the outgoing and incoming bike demand and (2) the depletion status for each CitiBike station in New York City. Our application uses these predictions to dynamically generate a paired list of stations to/from which to move bikes to ensure maximum rider fulfillment. Below, we lay out our framework for how we analyzed and cleaned the data and how we developed our application.
Our Project Workflow
- Exploratory Data Analysis
- Data Processing
- Modeling & Results
- Final Conclusion
Exploratory Data Analysis
Understanding the Data
We utilized two datasets: CitiBikeโs system data and The Open Bus bike share data. The CitiBike data consisted of a collection of monthly datasets spanning June 2013 to January 2020 with over 80 million total observations of individual rides across 15 variables; the Open Bus data was made up of monthly datasets from March 2015 to April 2019 with over 35 million total snapshot observations of station status across 14 variables. Below is a quick overview of the insights derived from exploratory analysis into the two datasets.
CitiBike Trips Data Insights
In 2019, there were 856 total active stations in New York City (excluding Jersey City) and roughly 12,800 bikes on the platform. According to the data, September had the highest number of average trips taken, closely followed by the summer months (June-August).
As can be seen, ridership peaks on weekdays during the morning and afternoon rush hour periods (8am-9am, 5pm-6pm), with the largest proportion of CitiBike trips taking over 15 minutes. Presumably, this represents many New Yorkersโ daily commute.
Additionally, we saw that the number of trips by start station came from heavily trafficked areas of Manhattan, including Midtown and the Lower East Side. In fact, the station with the highest number of trips is Grand Central Terminal (Pershing Square North station). Moreover, the station that saw the highest average trip duration is in the Lower East Side (Rivington & Ridge station), a fact we believe is driven by CitiBike users looking to enter Brooklyn from Manhattan taking advantage of the stationโs direct access to the borough via the Williamsburg Bridge.
Open Bus Station Data Insights
Recall that the Open Bus dataset was used to glean useful insights about the CitiBike dock stations. These insights corroborated the information from the CitiBike trips dataset: we observed that station utilization (bikes used/total docks) also peaked around the morning and afternoon rush hour periods.
Interestingly, the data showed that station utilization was actually higher during the spring season (March-May) rather than the summer, which might have been expected. We believe that this is due to CitiBikeโs policy of changing its โactiveโ fleet across seasons.
Our research showed that CitiBike removes bikes from the active fleet -- the number of bikes it has stationed in docks for use at any given time -- for repair and maintenance, but also to control the supply of bikes needed at times to meet overall demand. The higher station utilization in the spring can be explained by the larger active fleet CitiBike maintains in the summer -- more bikes leads to a smaller utilization percentage for any given station than does fewer bikes.
Data Processing
Individual Rides Dataset
For this project we restricted our data to the period where both datasets overlapped -- 2015 to 2019. The various monthly datasets from the CitiBike systems database were combined into a single database and processed.
Since the original data file had over 80 million rows, we utilized Daskโs multi-core processing to optimize Pandasโs performance. Overall, we added the following columns, derived from the original data, to the final dataset with all trips:
- Starttime_interval
- Stoptime_interval
- Season
- Dayofweek
We then broke the original dataset into two datasets: outgoing rides and incoming rides by grouping the data by station, date, and time interval to aggregate outgoing bike counts and incoming bike counts for each station at each time.
Next, we generated the target variables for the Outgoing dataset and Incoming datasets with which to train our machine learning models: Bike Demand and Dock Demand, respectively. We classified these variables by looking at the distribution of the outgoing bike count and incoming bike count from the two datasets and marking the top 25th percentile as โHighโ, the bottom 25th percentile as โLow,โ and the rest as โMedium.โ
Machine learning Dataset analyzing
After generating these features, we analyzed the new dataset and found that there was, on average, a higher dock demand during the morning rush hour than during the afternoon rush hour, while afternoon rush hour bike demand outpaced morning bike demand.
Intuitively, this discovery makes sense, as commuters rushing into the office in the morning spikes dock demand; after work, the same commuters look for bikes to get home or to the subway stations. This imbalance also confirms our belief that a rebalancing strategy is needed to mitigate the issue of โno docks in the morningโ or โno bikes in the afternoon.โ
Another takeaway is that the stations with high dock demands in the morning are also very similar to the stations with high bike demands in the afternoon. Intuitively this makes sense as most subscribers are commuters and they need to dock the bike near work in the morning rush hour and conversely need to take bikes from those same locations during the afternoon rush hour. The below maps demonstrate the dock demand and bike demand for all stations to help visualize where most commuters take bikes to and from on a daily basis.
Stations Dataset
The Open Bus dataset, containing snapshots about the individual stations at any given time, was much messier than the CitiBike individual rides data. Much of the data processing work went into parsing individual columns and handling incorrectly entered data. However, once the dataset had been cleaned, we followed a similar data processing procedure for this dataset to the one used for the individual rides dataset.
Because the snapshots of station health happened at inconsistent times and roughly at a rate of two to three snapshots per hour, we chose to group observations into half hour increments. We also created variables to indicate the day of the week and the season of each observation, to align the methodology with the individual trips dataset.
Using this, we created a variable indicating the depletion status of each station at the snapshot, which was calculated by dividing the available bikes at that time by the total docks at the given station. This percentage was used to classify stations into three risks: โFull Riskโ if the stationโs depletion status percentage was 66% or higher, โEmpty Riskโ if it was 33% or lower, or โHealthyโ otherwise.
An exploratory visualization, below, at the processed station data indicates that the vast majority of the stations are, on average, are classified as โEmpty Riskโ and therefore likely in need of rebalancing.
Machine Modeling & Rebalancing Results
Once the data processing was completed, we were able to turn our attention to our machine learning model and rebalancing algorithm.
Machine Learning
We trained three Random Forest Classifier models on our processed datasets, one each to predict bike demand, dock demand, and depletion status for each station in New York City when provided a date and time in the future. We made sure to implement train-test splitting to minimize overfitting on our training data, and tuned the models as best as we were able given available processing power.
Rebalancing Strategy
Once we had models trained to supply predictions for bike demand, dock demand, and depletion status for each station, we developed an iterative algorithm to identify which stations were in need of having bikes rebalanced out, which were in need of having bikes rebalanced in, and which were not in need of rebalancing at all.
Any station that our models identified as โFull Riskโ (again, a station that had many bikes but few open docks available) as well as having a medium-to-high dock demand (comparatively many incoming bikes) and a low-to-medium bike demand (comparatively few bikes going out) was classified as in need of having bikes rebalanced out.
Conversely, any station that our models identified as โEmpty Riskโ (a station with few bikes but many open docks) as well as having a low-to-medium dock demand (comparatively few incoming bikes) and a medium-to-high bike demand (comparatively many bikes going out) was classified as in need of having bikes rebalanced in.
Need of Having Bikes
Stations that were identified as in need of having bikes brought in were paired with stations identified as in need of having bikes taken out based on distance - the closest stations (calculated using Manhattan distance to account for New York Cityโs block system and the fact that the bikes are rebalanced via car) were paired together. Once stations were paired, their depletion statuses were updated in the master table and they were removed from the list of stations to be rebalanced, in order to avoid double counting.
Finally, in the event where more stations needed rebalancing out than did rebalancing in (or vice versa), stations with the appropriate level of depletion status and medium demand for both bike demand and dock demand were used to rebalance the leftover stations.
The algorithm then outputs the results of the paired list as an easy-to-reference table using individual station IDs as identifiers. The results can also be mapped using the unique station. An example of both the output table and a preliminary mapping are shown below:
Conclusion
Thank you so much for taking the time to read about our project!
In the future, weโre interested in fleshing out the datasets with traffic and precipitation data, accounting for one-way streets in our rebalancing algorithm, and developing a user-interface for the app that allows for a more user-friendly approach to generating the paired list of stations from which/to rebalance bikes.
Please feel free to take a look at our GitHub accounts below if youโre interested in the code that went into making this project possible or our presentation deck. Alternatively, we can be found on LinkedIn as well.
Christian Opperman: LinkedIn | GitHub
Melanie Zheng: LinkedIn | GitHub
Paul Choi: LinkedIn | GitHub