Citibike Ain't Got Flow
Motivation:Β Many of my friends and family members regularly use citibike as an eco-friendly, active way of getting around New York City. Β Though they enjoy the service, I often hear them complain that there are no bikes where they want them and too many bikes where they don't want them. Β I sought to understand user patterns and bike distribution across the many citibike docks scattered throughout the city.
Data: Citibike has released downloadable trip histories for every month of their service since July of 2013. Β I pulled the dataset for just last month (June, 2016), which contains the following (n =Β 1,460,318):
- Trip Duration (seconds)
- Start Time and Date
- Stop Time and Date
- Start Station Name
- End Station Name
- Station ID
- Station Lat/Long
- Bike ID
- User Type (Customer = 24-hour pass or 7-day pass user; Subscriber = Annual Member)
- Gender (Zero=unknown; 1=male; 2=female)
- Year of Birth
Analysis and Visualization
First I wanted to understand general citibike usage by time of day. Β To do this, I decided to count the total number of citibike trips started for each hour of the day and plot my results on a bar chart. Β (initial data manipulation code at the end of the post)
One can see that citibike is very clearly a commuter tool, as the bar chart peaks during morning and evening rush hours.
BecauseΒ citibike usage is highest during rush hours, it seemed important to distinguish weekday and weekend usage. Β Separating all trips by weekday starts and weekend starts and plotting hourly usage on a bar chart gave the following two graphs.
Not surprisingly, user patterns are very different on weekdays and weekends. Β Interested in potential directional flow problems caused by commuters using citibike, I decided to proceed with just weekday data.
City Usage Visualization Functions
In order to visualize activity at individual stations as well as across the city as a whole, I wrote two functions to generate maps of station usage across specified time periods.
maps_percent(x, y):Β creates a map that shows, between hours x and y, the percentage of all citibike trips that started at each station. Β Each station is represented by a dot, and its percentage is represented by dot color and size. Β maps_percent also prints the top 5 stations with the highest percentage, as well as the median percentage. (code at the end of the post)
maps_ratio(x, y): creates a map that shows, between hours x and y, the ratio of trips that started and trips that ended at each dock. Β Each station is represented by a dot, and the positivity of its trips started:ended ratio is represented by dot color and size. Β Maps_ratio also prints the top 5 stations with the highest ratio, as well as the median ratio. (code at the end of the post)
Mapping Station Activity
I first wanted to get a sense of station activity in general (not broken down by hour) to understand which stations experienced heavy (start) usage and which had extreme trips started:ended ratios.
Mapping Station Activity by Time of Day
Then I wanted to see how this usage changed by time of day. Β I chose hour groupings that represented morning rush hour, the work day, and evening rush hour. Β Particularly interested in the flow and natural redistribution of bikes across individual stations, I chose the maps_ratio() function. (click to enlarge)
Β These three graphs show several interesting patterns. Β In the morning (hrs 6-10), stations in midtown and the financial district (really most of the central island, through to the tip) have low start:end ratios, whereas peripheral, residential neighborhoods show high ratios. Β Commuters take bikes into the center of Manhattan on their way to work, and there is little balancing traffic in the other direction. Β This effect seems to disappear during the day (hrs 11-15) when there is a lot of parity between the flow ratios of different stations--the middle of a work day seems to create no clear usage directionality. Β In the evening, the commuter effect reappears, except this time not surprisingly in the reverse. Β Stations in midtown and the financial district have higher flow ratios than in peripheral residential neighborhoods as people commute home from work. Β Additionally, large outliers (big black dots) appear during morning and even rush hours, and disappear during the day--which can be explained by the fact that many of them are located around transportation hubs.
Conclusions:
- There is a very clear difference between weekday and weekend citibike user patterns.
- Differences across neighborhood:
- midtown and central manhattan have high shares of all trips started, whereas alphabet city, Brooklyn, and the Upper East/West Sides have low shares of all trips started.
- Ratio disequilibrium arises during commuter hours (large black dots appear on ratio plots during morning and evening rush hours), which means certain docks are naturally depleted during those times. Β In the morning, there is flow out of residential neighborhoods into midtown and the downtown financial district. Β In the evening, the opposite is true.
- During the middle of the workday (hours 11-15), flow is naturally more equilibrated (dots are more similar in size and color to each other and outliers disappear).
- Usage and ratio visualizations are dominated by very strong individual points--frequently transportation hubs.
- Pershing Square North -- Grand Central Area (% of all trips graph)
- Penn Station Valet (ratio plots 0-23 & 6-10)
- W 42 St & Dyer Ave - Port Authority/Lincoln Tunnel Area (ratio plots 0-23 & 6-10)
- W 52 St & 5 Ave - Rockefeller Center Area (ratio plot 16-20)
Further Questions:
- Where is the unmet demand? Β Though there is clearly usage disequilibrium at certain stations, that might be due partly to manual redistribution (trucks shuttling bikes to and from certain stations) to seemingly high start-traffic stations. Β Redistribution of this sort makes those stations' trips_started:trips_ended ratios even more lopsided, as truck refills allow a station additional starts without corresponding additional ends. Β What other stations could achieve such lopsided ratios if manually refilled?
- What would a similar analysis look like for weekend data?
- How do user patterns change by time of year?
With More Time:
- Citibike only gets marginal revenue from non-subscriber usage (subscribers pay an annual fee). Β Many non-subscribers are tourists, and thus likely do not conform to commuter patterns. Β An analysis of non-subscriber usage patterns might provide insights more directly geared towards revenue maximization than usage maximization.
Initial Data Manipulation code:
maps_percent Function Code:
maps_ratio Function Code:
Map templates from URL : http://maps.googleapis.com/maps/api/staticmap?center=New+York+City&zoom=12&size=640x640&scale=2&maptype=roadmap&language=en-EN&sensor=false
Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=New%20York%20City&sensor=false