Citibike Ain't Got Flow

William Bartlett
Posted on Jul 25, 2016

Motivation: Many of my friends and family members regularly use citibike as an eco-friendly, active way of getting around New York City.  Though they enjoy the service, I often hear them complain that there are no bikes where they want them and too many bikes where they don't want them.  I sought to understand user patterns and bike distribution across the many citibike docks scattered throughout the city.

Data: Citibike has released downloadable trip histories for every month of their service since July of 2013.  I pulled the dataset for just last month (June, 2016), which contains the following (n = 1,460,318):

  • Trip Duration (seconds)
  • Start Time and Date
  • Stop Time and Date
  • Start Station Name
  • End Station Name
  • Station ID
  • Station Lat/Long
  • Bike ID
  • User Type (Customer = 24-hour pass or 7-day pass user; Subscriber = Annual Member)
  • Gender (Zero=unknown; 1=male; 2=female)
  • Year of Birth

Analysis and Visualization

First I wanted to understand general citibike usage by time of day.  To do this, I decided to count the total number of citibike trips started for each hour of the day and plot my results on a bar chart.  (initial data manipulation code at the end of the post)

Screen Shot 2016-07-24 at 9.36.19 PM

https://gist.github.com/willhbartlett/cf4c5954a49b31f270f93736b6e8d653

One can see that citibike is very clearly a commuter tool, as the bar chart peaks during morning and evening rush hours.

Because citibike usage is highest during rush hours, it seemed important to distinguish weekday and weekend usage.  Separating all trips by weekday starts and weekend starts and plotting hourly usage on a bar chart gave the following two graphs.

Screen Shot 2016-07-24 at 9.40.50 PM

Screen Shot 2016-07-24 at 9.42.23 PM

https://gist.github.com/willhbartlett/fff8089ef349edb7a8227d85a0444298

 

Not surprisingly, user patterns are very different on weekdays and weekends.  Interested in potential directional flow problems caused by commuters using citibike, I decided to proceed with just weekday data.

City Usage Visualization Functions

In order to visualize activity at individual stations as well as across the city as a whole, I wrote two functions to generate maps of station usage across specified time periods.

maps_percent(x, y): creates a map that shows, between hours x and y, the percentage of all citibike trips that started at each station.  Each station is represented by a dot, and its percentage is represented by dot color and size.  maps_percent also prints the top 5 stations with the highest percentage, as well as the median percentage. (code at the end of the post)

maps_ratio(x, y): creates a map that shows, between hours x and y, the ratio of trips that started and trips that ended at each dock.  Each station is represented by a dot, and the positivity of its trips started:ended ratio is represented by dot color and size.  Maps_ratio also prints the top 5 stations with the highest ratio, as well as the median ratio. (code at the end of the post)

 

Mapping Station Activity

I first wanted to get a sense of station activity in general (not broken down by hour) to understand which stations experienced heavy (start) usage and which had extreme trips started:ended ratios.

https://gist.github.com/willhbartlett/37a7929ace143a3503b2aa4a349a650a

According to this graph, more citibike trips start in commercial, downtown Manhattan than in residential neighborhoods such as the Upper East and West Sides, alphabet city, and Brooklyn.

 

According to this graph, most stations generally have similar flow ratios of around 1 trip started for every trip ended. However, there are several major outliers in both Manhattan and Brooklyn who see far more trips leave than arrive, and would thus naturally become depleted.

 

 

 

 

 

 

 

 

 

 

 

 

 

Mapping Station Activity by Time of Day

Then I wanted to see how this usage changed by time of day.  I chose hour groupings that represented morning rush hour, the work day, and evening rush hour.  Particularly interested in the flow and natural redistribution of bikes across individual stations, I chose the maps_ratio() function. (click to enlarge)

https://gist.github.com/willhbartlett/9bff0735f32ed0a481a35c7465dc0d31

3 These three graphs show several interesting patterns.  In the morning (hrs 6-10), stations in midtown and the financial district (really most of the central island, through to the tip) have low start:end ratios, whereas peripheral, residential neighborhoods show high ratios.  Commuters take bikes into the center of Manhattan on their way to work, and there is little balancing traffic in the other direction.  This effect seems to disappear during the day (hrs 11-15) when there is a lot of parity between the flow ratios of different stations--the middle of a work day seems to create no clear usage directionality.  In the evening, the commuter effect reappears, except this time not surprisingly in the reverse.  Stations in midtown and the financial district have higher flow ratios than in peripheral residential neighborhoods as people commute home from work.  Additionally, large outliers (big black dots) appear during morning and even rush hours, and disappear during the day--which can be explained by the fact that many of them are located around transportation hubs.

Conclusions:

  • There is a very clear difference between weekday and weekend citibike user patterns.
  • Differences across neighborhood:
    • midtown and central manhattan have high shares of all trips started, whereas alphabet city, Brooklyn, and the Upper East/West Sides have low shares of all trips started.
    • Ratio disequilibrium arises during commuter hours (large black dots appear on ratio plots during morning and evening rush hours), which means certain docks are naturally depleted during those times.  In the morning, there is flow out of residential neighborhoods into midtown and the downtown financial district.  In the evening, the opposite is true.
  • During the middle of the workday (hours 11-15), flow is naturally more equilibrated (dots are more similar in size and color to each other and outliers disappear).
  • Usage and ratio visualizations are dominated by very strong individual points--frequently transportation hubs.
    • Pershing Square North -- Grand Central Area (% of all trips graph)
    • Penn Station Valet (ratio plots 0-23 & 6-10)
    • W 42 St & Dyer Ave - Port Authority/Lincoln Tunnel Area (ratio plots 0-23 & 6-10)
    • W 52 St & 5 Ave - Rockefeller Center Area (ratio plot 16-20)

Further Questions:

  • Where is the unmet demand?  Though there is clearly usage disequilibrium at certain stations, that might be due partly to manual redistribution (trucks shuttling bikes to and from certain stations) to seemingly high start-traffic stations.  Redistribution of this sort makes those stations' trips_started:trips_ended ratios even more lopsided, as truck refills allow a station additional starts without corresponding additional ends.  What other stations could achieve such lopsided ratios if manually refilled?
  • What would a similar analysis look like for weekend data?
  • How do user patterns change by time of year?

With More Time:

  • Citibike only gets marginal revenue from non-subscriber usage (subscribers pay an annual fee).  Many non-subscribers are tourists, and thus likely do not conform to commuter patterns.  An analysis of non-subscriber usage patterns might provide insights more directly geared towards revenue maximization than usage maximization.

 

Initial Data Manipulation code:

https://gist.github.com/willhbartlett/1c448958b894feac3f98a4ceadd79b47

maps_percent Function Code:

https://gist.github.com/willhbartlett/d723cd7090cdc48cc7e9922f5145bdfb

maps_ratio Function Code:

https://gist.github.com/willhbartlett/858f60c3f6ea398876cd6aac2a3f2317

Map templates from URL : http://maps.googleapis.com/maps/api/staticmap?center=New+York+City&zoom=12&size=640x640&scale=2&maptype=roadmap&language=en-EN&sensor=false
Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=New%20York%20City&sensor=false

 

 

About Author

William Bartlett

William Bartlett

Will Bartlett is a History of Science and Medicine Major from Yale University who recently took a leave of absence from medical school to explore data science. As an undergraduate, he studied the role of data in medicine...
View all posts by William Bartlett >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp