NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Capstone > Machine Learning: The CitiBike Station Rebalancing Issue

Machine Learning: The CitiBike Station Rebalancing Issue

Paul Choi, Christian Opperman and Melanie Zheng
Posted on Mar 27, 2020
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Links: GitHub | Presentation | CitiBike Data

Introduction

Have you ever ridden a CitiBike in New York City, only to reach your destination and found no docks available to return your bike? Or perhaps youโ€™ve frustratedly searched on your CitiBike app to locate the second -- or worse, the third -- closest station to you because the one around the corner from your apartment is empty.

For those unfamiliar with the CitiBike experience, these are issues that many New Yorkers face when relying upon the worldโ€™s largest bike share program for their daily commute. In a nutshell, poor bike and return dock availability continues to be a key issue behind Citibike customer retention, despite CitiBikeโ€™s attempts to move bikes from one hotspot to another throughout the day.

Our Solution

Our group, a team of data scientists, decided to re-evaluate CitiBikeโ€™s station rebalancing strategy. We developed an application in Python that uses machine learning to create a strategy to rebalance bikes from full/near-full stations to empty/near-empty to full/near-full stations (and vice versa!).

Given a date and time, we can predict (1) the outgoing and incoming bike demand and (2) the depletion status for each CitiBike station in New York City. Our application uses these predictions to dynamically generate a paired list of stations to/from which to move bikes to ensure maximum rider fulfillment. Below, we lay out our framework for how we analyzed and cleaned the data and how we developed our application.

Our Project Workflow

  1. Exploratory Data Analysis
  2. Data Processing
  3. Modeling & Results
  4. Final Conclusion

Exploratory Data Analysis

Understanding the Data

We utilized two datasets: CitiBikeโ€™s system data and The Open Bus bike share data. The CitiBike data consisted of a collection of monthly datasets spanning June 2013 to January 2020 with over 80 million total observations of individual rides across 15 variables; the Open Bus data was made up of monthly datasets from March 2015 to April 2019 with over 35 million total snapshot observations of station status across 14 variables. Below is a quick overview of the insights derived from exploratory analysis into the two datasets.

CitiBike Trips Data Insights

In 2019, there were 856 total active stations in New York City (excluding Jersey City) and roughly 12,800 bikes on the platform. According to the data, September had the highest number of average trips taken, closely followed by the summer months (June-August).

As can be seen, ridership peaks on weekdays during the morning and afternoon rush hour periods (8am-9am, 5pm-6pm), with the largest proportion of CitiBike trips taking over 15 minutes. Presumably, this represents many New Yorkersโ€™ daily commute.

Additionally, we saw that the number of trips by start station came from heavily trafficked areas of Manhattan, including Midtown and the Lower East Side. In fact, the station with the highest number of trips is Grand Central Terminal (Pershing Square North station). Moreover, the station that saw the highest average trip duration is in the Lower East Side (Rivington & Ridge station), a fact we believe is driven by CitiBike users looking to enter Brooklyn from Manhattan taking advantage of the stationโ€™s direct access to the borough via the Williamsburg Bridge.

Open Bus Station Data Insights

Recall that the Open Bus dataset was used to glean useful insights about the CitiBike dock stations. These insights corroborated the information from the CitiBike trips dataset: we observed that station utilization (bikes used/total docks) also peaked around the morning and afternoon rush hour periods.

Interestingly, the data showed that station utilization was actually higher during the spring season (March-May) rather than the summer, which might have been expected. We believe that this is due to CitiBikeโ€™s policy of changing its โ€œactiveโ€ fleet across seasons.

Our research showed that CitiBike removes bikes from the active fleet -- the number of bikes it has stationed in docks for use at any given time -- for repair and maintenance, but also to control the supply of bikes needed at times to meet overall demand. The higher station utilization in the spring can be explained by the larger active fleet CitiBike maintains in the summer -- more bikes leads to a smaller utilization percentage for any given station than does fewer bikes.

Data Processing

Individual Rides Dataset

For this project we restricted our data to the period where both datasets overlapped -- 2015 to 2019. The various monthly datasets from the CitiBike systems database were combined into a single database and processed.

Since the original data file had over 80 million rows, we utilized Daskโ€™s multi-core processing to optimize Pandasโ€™s performance. Overall, we added the following columns, derived from the original data, to the final dataset with all trips:

  • Starttime_interval
  • Stoptime_interval
  • Season
  • Dayofweek

We then broke the original dataset into two datasets: outgoing rides and incoming rides by grouping the data by station, date, and time interval to aggregate outgoing bike counts and incoming bike counts for each station at each time.

Next, we generated the target variables for the Outgoing dataset and Incoming datasets with which to train our machine learning models: Bike Demand and Dock Demand, respectively. We classified these variables by looking at the distribution of the outgoing bike count and incoming bike count from the two datasets and marking the top 25th percentile as โ€˜Highโ€™, the bottom 25th percentile as โ€˜Low,โ€™ and the rest as โ€˜Medium.โ€™

Machine learning Dataset analyzing

After generating these features, we analyzed the new dataset and found that there was, on average, a higher dock demand during the morning rush hour than during the afternoon rush hour, while afternoon rush hour bike demand outpaced morning bike demand.

Intuitively, this discovery makes sense, as commuters rushing into the office in the morning spikes dock demand; after work, the same commuters look for bikes to get home or to the subway stations. This imbalance also confirms our belief that a rebalancing strategy is needed to mitigate the issue of โ€˜no docks in the morningโ€™ or โ€˜no bikes in the afternoon.โ€™

Another takeaway is that the stations with high dock demands in the morning are also very similar to the stations with high bike demands in the afternoon. Intuitively this makes sense as most subscribers are commuters and they need to dock the bike near work in the morning rush hour and conversely need to take bikes from those same locations during the afternoon rush hour. The below maps demonstrate the dock demand and bike demand for all stations to help visualize where most commuters take bikes to and from on a daily basis.

Stations Dataset

The Open Bus dataset, containing snapshots about the individual stations at any given time, was much messier than the CitiBike individual rides data. Much of the data processing work went into parsing individual columns and handling incorrectly entered data. However, once the dataset had been cleaned, we followed a similar data processing procedure for this dataset to the one used for the individual rides dataset.

Because the snapshots of station health happened at inconsistent times and roughly at a rate of two to three snapshots per hour, we chose to group observations into half hour increments. We also created variables to indicate the day of the week and the season of each observation, to align the methodology with the individual trips dataset.

Using this, we created a variable indicating the depletion status of each station at the snapshot, which was calculated by dividing the available bikes at that time by the total docks at the given station. This percentage was used to classify stations into three risks: โ€˜Full Riskโ€™ if the stationโ€™s depletion status percentage was 66% or higher, โ€˜Empty Riskโ€™ if it was 33% or lower, or โ€˜Healthyโ€™ otherwise.

An exploratory visualization, below, at the processed station data indicates that the vast majority of the stations are, on average, are classified as โ€˜Empty Riskโ€™ and therefore likely in need of rebalancing.

machine

Machine Modeling & Rebalancing Results

Once the data processing was completed, we were able to turn our attention to our machine learning model and rebalancing algorithm.

Machine Learning

We trained three Random Forest Classifier models on our processed datasets, one each to predict bike demand, dock demand, and depletion status for each station in New York City when provided a date and time in the future. We made sure to implement train-test splitting to minimize overfitting on our training data, and tuned the models as best as we were able given available processing power.

Rebalancing Strategy

Once we had models trained to supply predictions for bike demand, dock demand, and depletion status for each station, we developed an iterative algorithm to identify which stations were in need of having bikes rebalanced out, which were in need of having bikes rebalanced in, and which were not in need of rebalancing at all.

machine

Any station that our models identified as โ€˜Full Riskโ€™ (again, a station that had many bikes but few open docks available) as well as having a medium-to-high dock demand (comparatively many incoming bikes) and a low-to-medium bike demand (comparatively few bikes going out) was classified as in need of having bikes rebalanced out.

machine

Conversely, any station that our models identified as โ€˜Empty Riskโ€™ (a station with few bikes but many open docks) as well as having a low-to-medium dock demand (comparatively few incoming bikes) and a medium-to-high bike demand (comparatively many bikes going out) was classified as in need of having bikes rebalanced in.

Need of Having Bikes

Stations that were identified as in need of having bikes brought in were paired with stations identified as in need of having bikes taken out based on distance - the closest stations (calculated using Manhattan distance to account for New York Cityโ€™s block system and the fact that the bikes are rebalanced via car) were paired together. Once stations were paired, their depletion statuses were updated in the master table and they were removed from the list of stations to be rebalanced, in order to avoid double counting.

Finally, in the event where more stations needed rebalancing out than did rebalancing in (or vice versa), stations with the appropriate level of depletion status and medium demand for both bike demand and dock demand were used to rebalance the leftover stations.

The algorithm then outputs the results of the paired list as an easy-to-reference table using individual station IDs as identifiers. The results can also be mapped using the unique station. An example of both the output table and a preliminary mapping are shown below:machinemachine

Conclusion

Thank you so much for taking the time to read about our project! 

In the future, weโ€™re interested in fleshing out the datasets with traffic and precipitation data, accounting for one-way streets in our rebalancing algorithm, and developing a user-interface for the app that allows for a more user-friendly approach to generating the paired list of stations from which/to rebalance bikes.

Please feel free to take a look at our GitHub accounts below if youโ€™re interested in the code that went into making this project possible or our presentation deck. Alternatively, we can be found on LinkedIn as well.

Christian Opperman: LinkedIn | GitHub
Melanie Zheng: LinkedIn | GitHub
Paul Choi: LinkedIn | GitHub

machine

About Authors

Paul Choi

View all posts by Paul Choi >

Christian Opperman

Christian Opperman is a data scientist and former project manager at Mitsui & Co., Ltd. with experience in technical writing, a bachelors in Economics, and a Masters in Writing.
View all posts by Christian Opperman >

Melanie Zheng

Melanie is currently enrolled in Georgia Institute of Technology for Master's Degree in Computer Science with Machine Learning specialization. She previously worked as a product manager at Viacom and project manager at Citigroup.
View all posts by Melanie Zheng >

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application