Data Inferential Analysis On Citi Bike Ridership
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Citi Bike, a bike-sharing program operated by Motivate, offers New Yorkers an option for transportation that is faster than walking, cheaper than a taxi, greener than a bus, and more fun than subway. Data: The program was launched in May 2013 with 6,000 bikes serving Manhattan and Brooklyn.
Through years of expansion, Citi Bike now has dock stations across Manhattan, Brooklyn, Queens, and Jersey City, and the bike fleet has increased to over 13,000. It has become popular for both weekdays commutes and weekends leisure activities and so makes up a crucial component of New York City’s transportation network.
For this project, we set out to help Citi Bike learn the riding patterns in NYC in order to identify the most crucial factors that could impact their business so that they could devise an effective expansion strategy. To do that, we first wanted to understand how bike ridership has changed in recent years. The graph below shows the average Citi Bike trips by month from 2017 (the lightest blue) to 2019 (the darkest blue).
It appears that there is an upward trend in average trips taken by New Yorkers, which indicates the bikes have succeeded in attracting new customers and gaining popularity. Taking a closer look, we observe that 2018 has a smaller growth (8%) as compared to growth in 2019 (20%). This could be due to the lower expansion, as we will discuss later.
Two questions we set out to answer were: What could have caused this upward trend besides marketing and station expansions? How did other public transportation evolve during the same period of time?
In order to understand the whole picture, we found some transportation data from the 2019 mobility report published by the NYC Department of Transportation. The following graph shows the average yearly ridership in millions for bus, subway, taxi (including Uber and Lyft) and cycling from 2011 to 2017. This graph indicates a downward trend for bus and subway, and an upward trend for both taxi and cycling.
Our further research offers an explanation for the drop in bus ridership. The graph below shows the bus speed trend for both Citiwide (blue) and Manhattan areas (pink). The green dotted line represents the average bike speed published by Citi Bike, which is around 7.5 mph. Since 2012, the Manhattan (below 60th street) bus speed has dropped significantly from 9.1mph to 7.1mph due to the congested traffic in the city.
A slower drop in Citiwide bus speed is also observed from the trend. This suggests that riding a Citi Bike could potentially cut down on your commuting time, especially in the Manhattan area, when compared to riding a bus. This could be the reason why buses are slowly losing popularity in NYC.
After comparing Citi Bike ridership to bus ridership, we investigated how they compare to taxis. We learned from our later analyses that the most frequent biking distance by Citi Bike users is between 1 to 1.5 miles. For this range of distance, the average time spent riding a Citi Bike is around 11 minutes for both the congested midtown area and other city neighborhoods like Queens and Brooklyn where there are fewer traffic jams.
The average duration of a taxi ride in the midtown area is 15 minutes, and this indicates that riding a bicycle saves commuters a lot of time. In less congested areas, the duration of taking a taxi is about the same as riding the Citi Bike. The cost, however, is considerably lower for cycling.
From the above comparisons, we could clearly observe the advantages of Citi Bike over buses and taxis in both time and money saving. We would then dive deeper into the Citi Bike operation to identify important factors that have a strong impact on its business and study the similarity and discrepancies between all dock stations. Finally, we wanted to help Citi Bike increase its ridership and make recommendations for future expansion.
The monthly trip data obtained from Citi Bike cover information, such as trip duration, start and end time, start and end locations, Bike ID, and type of users (subscriber or not). Due to the size of the monthly trip data, we took a 5% sample from each month as our data for exploratory analyses and model building.
The weather reports from NOAA cover daily average temperature information, precipitation, and whether or not a day had bad weather conditions for biking, such as thunder or haze, in the NY-Central Park Area.
The timeframe for this project is from January 2017 to December 2019 from 5 am - 10 pm when most of the activities occurred. After data merging and cleaning, our data have over 2.6 million trip observations and 57 features including some custom features.
We first wanted to see how and where did Citi Bike expand its station in the past 3 years. The plot below shows the number of total stations by quarters. We observed that the number of stations increased significantly in the second half of 2017 as well as the last quarter in 2019; however, Citi Bike did not invest heavily in station expansion throughout the year 2018. This aligns well with the earlier discussion of slower growth in ridership in 2018 compared to the growth in 2019.
We then took a closer look at where those new stations were located. Green circles in the animation below represent new stations added during the quarter, and red circles indicate stations that were removed. We observed that a lot of new stations were built in Harlem, Long Island City and Brooklyn in 2017 Q3 and Q4. There were not a lot of activities in 2018 compared to the other two years. In 2019, the expansion focus was given to the Brooklyn borough, where Citi Bike built over 50 new stations.
Next, we wanted to understand has the activity changed over time and how does weather impact the total ridership. The graph below is a seasonal trend of Citi Bike trips by days from 2017 to 2019. The top graph shows the total count of trips, the middle shows the average trip duration in seconds and the last graph shows the average distance traveled per trip. It is obvious to observe a seasonality in all three measures.
Trips increased not just in number but induration and distance in warmer seasons, as seen in the month of August, as compared to colder seasons, as exemplified by the month of February. This is not surprising as people spend more time outdoors in warmer weather.
After seeing this seasonality in activity, we further investigated how temperature impacts the ridership to confirm what we learned from the seasonal trend. The graph below shows the count of trips per day against the average temperature in a day. We observed a linear relationship between these two features. On average, one degree increase in Fahrenheit will result in an increase in total trips by roughly 760 trips in a day.
How about other weather conditions? To answer the questions, we took a closer look to compare the number of activities during days with a bad weather condition against the days when there were not. What we show here includes conditions such as precipitation, snow, fog, heavy fog, thunder, and haze. Blue bars represent days without such condition and orange bars represent the opposite. We saw a clear drop in the number of rides when there was a weather condition in all categories.
From these analyses, we concluded that weather has a big impact on the bike ridership, and Citi Bike should pay attention to the weather forecast in preparation for a demand shift in response to sudden weather change or extreme weather situations.
Weekly and Hourly Trends
After seeing the impact of weather on the ridership, we were curious to learn about the changes in ridership during a week and a day. Here is a graph comparing weekends (left) and weekdays (right) activities on the same three measures we discussed earlier from the seasonal trend.
There are more trips during weekdays, as most of Citi Bike users ride bikes as a commute to work. Users spend less time on a bike during weekdays than weekends by an average of 3 mins. This can be explained that they try to get to the destinations as soon as possible under time constraints. In both scenarios, the average trip distance is 1.5 miles per trip.
The map below shows the hotspots for the weekends (orange circles) and weekdays (blue circles). Most of the weekend activities occur in areas such as Central Park and along the Hudson River. During weekdays, most of the activities occur in mid-Manhattan where most of the business centers are located. Knowing this information could help Citi Bike to come up with a better logistic plan during a week.
The next question we had in mind was how does demand change over the course of a day between the hours of 5 am and 10 pm. The animation below helped us visualize the hourly change in activity during a day. The color represents the amount of activity in that station; the darker the color, the more activity. For better visualization purposes, we filtered out stations that have very low activity in this animation. We do see a major variation in the number of trips during a day from all stations.
To help us identify the busy times during the day, we plotted the hourly trend of bike ridership. During 8 - 9 am and 4 - 7 pm, we observed a higher trip count than early morning, late evening and time in between. We classified these hours as rush hours for later analysis.
Once we have identified the rush hours, we were interested to see whether supply and demand are balanced during the rush hours. This map shows stations that could have a shortage/surplus in bike demand during rush hours. Red circles are locations where there were a lot more outbound bikes than inbound bikes during one hour period.
This could potentially lead to a shortage of bike supply. Blue circles, on the other hand, are stations that could potentially have a surplus in bike supply. Green circles mean they are in a healthy state.
Notice that there are a lot of red circles at 8 am near the lower east side. These are stations located in Alphabet City, where there is no subway coverage nearby. A lot of people there ride bikes to commute to work. What’s worth mentioning is that we also observed a lot of blue circles (inbound much greater than outbound) in Middletown East, Diamond District, and Financial District.
These are the popular destinations that have a high density of businesses and offices. The opposite trend was observed during the afternoon rush hour (4 - 7 pm). Those previously blue locations now become red, indicating that a lot of people check out bikes from those locations after a day of work. We identified an opposite flow of traffic compared to the morning rush hours.
Now we have discussed the seasonal, weekly and hourly trend of bike activities. Do all stations behave in similar ways or are they substantially different from each other? We used K-Mean to help us group similar stations. Here we selected 4 clusters (groups) based on the result from an elbow plot. The activities are summarised in the below table. Cluster 3 are stations with the highest activity in terms of hourly ridership and percent usage(total activity/station capacity).
Cluster 1 also has high activity compared to cluster 2 and 0, which have similar hourly ridership but very different average travel distance. Cluster 0 are stations that have the lowest activity but the farthest distance.
We are going to see where each cluster is located in the below map. Clusters are color-coded in this map. We were able to identify cluster 3 (in teal, the highest activity stations) containing stations near major transit hubs, such as Grand Central, Port Authority and Penn Station, as well as tourist hubs like Central Park. Cluster 1 (in orange, the second-highest activity stations) are stations that are mostly located in central Manhattan where there is a denser population and more congested traffic.
Low Activity Groups
Cluster 0 (in blue) and 2 (in red) are low activity groups with long travel distance and short travel distance, respectively. We are going to take a closer look at these two groups. The map below shows the popular destinations in orange from cluster 0 stations which are in blue. The majority of the cluster 0 stations are located in the Upper West Side and Bushwick in Brooklyn. When we look at popular destinations, they span most of Manhattan and some areas in Brooklyn.
However, if we look at stations from cluster 2, the origin and destinations are overlapping suggesting a shorter travel distance. We found this quite interesting, and we would like to explore the reason why cluster 0 stations travel farther than other stations in our future work. Some possible thoughts are maybe the accessibility to other public transportation in these areas is different than other groups in terms of the number of choices and running frequency.
Perhaps the demography is different, which could have a big impact on the activity. We would need more public transportation and demography data to confirm our ideas.
Now we have some ideas of what plays an important role in the rider activity from our exploratory data analyses, we will spend some time discussing the machine learning tasks we performed. Our goals with machine learning are:
- Build regression models to predict hourly checkout counts
- Classify whether a station would be high or low turnover
Our focus was on inferential analysis to help Citi Bike understand what could impact their business the most, instead of predictive analysis that involves some sophisticated models for high predicting accuracy. Now we are going to discuss the regression models first.
Linear regression (Lasso)
The first model we built was a linear model using Lasso regression. This graph shows the slopes of each feature at various levels of lambda (a parameter we tuned in Lasso Model). We used this graph for feature selection to avoid overfitting issues.
As we expected, we received a poor performance from the Lasso Model, an R2 of 0.33 on the test dataset. This is due to the non-linear relationship between the target and the features in our model. For example, the latitude of a station does not impact the bike checkout counts in a linear way (one unit change in latitude does not cause a set amount of change in bike checkout counts). Graph below is a good example.
It appears that stations located at 40.75 N have the highest activity. This is the Mid Manhattan area, and the further away from Mid Manhattan regardless of North or South, the less activity the stations have. This behavior violates the assumption of a linear model, so the model did not have a good performance.
Tree-based regression (XGBoost)
After seeing the incapability of a linear model, we decided to train a tree-based model XGBoost that is capable of detecting the non-linear relationship between the target variable and features. What is nice about a tree-based model is that it could output the feature importance so it is easier to interpret the result as shown in the graph below.
From this graph, we learned that the most important features were cluster (which group does the station belong to) and commute(whether it is during a rush hour). Other important features include dock capacity, average temperature, hour of the day, location of the station, and day of the week. XGBooost model did not deem weather conditions as important factors, which seemed to be contradicting to what we observed earlier in our EDA.
The reason for the discrepancy is that our trip data obtained from Citi Bike were very unbalanced in the way that observations with no weather conditions outnumbered observations with any weather conditions. The model is incapable of capturing the real impact of the weather conditions. This could be mitigated by having hourly weather information instead of daily information in our future work. Overall we saw a big improvement in model performance, progressing from 33% (Lasso) to 80% (XGBoost).
Tree-based Classifier (Random Forest)
Having successfully identified the important factors that impact the ridership through machine learning regression models, we moved on to our next goal of predicting whether a station would be high or low turnover at certain conditions. We used the median percent usage (49%) as a cutoff for the binary classifier. The percent usage was defined as the total activity (inbound + outbound counts) as a percentage of the station’s capacity.
The graph below shows the feature importance score ranked by the Random Forest classifier model. We obtained similar feature importance as the previous XGBoost model for regression. Overall predicting accuracy on our test dataset was 79% using this model.
To summarize our study, we successfully identified the most important factors that impact bike's ridership. They are:
- Which group does the station belong to (station cluster)?
- Travel hour (rush or non-rush)
A station that is closer to a transit or tourist hub will have a significantly higher demand than a station that is in Brooklyn or Queens. The demand also fluctuates during the day. We identified the rush hours to be 8 - 9 am and 4 -7 pm. There are possible areas that could have demand and supply imbalanced issues. Finally, temperature plays an important role in the bike ridership. One degree increase in temperature would result in a 760 increase in bike ridership. Cycling in winter is significantly less popular than in summer, so winter maintenance is critical to Citi Bike's operation.
Recommendations and Future Works
Based on what we have learned from the resources we had, we would like to make a few recommendations to the Citi Bike team and list out our future study that could benefit Citi Bike.
Demand > Supply
We have seen that in rush hours stations in certain areas (e.g. Alphabet City) are likely to have shortage, i.e. the number of outbound is much higher than the number of inbounds. We do see the opportunity to improve this situation by developing a more efficient rebalancing strategy but that requires some future study on what is the best frequency and where should the bikes be rebalanced from in order to satisfy the need while reducing the cost of hauling. We would also like to analyze the effectiveness of current incentive based rebalancing program and make recommendations in our future work.
Citi Bike could also build new stations in areas with low subway coverage as we saw a big demand in low subway coverage especially during rush hours. This requires some demographic study.
Demand < Supply
Stations in certain areas (e.g. Brooklyn) are less popular than stations in central Manhattan. For these areas, Citi Bike should focus on attracting new customers but at the same time retaining loyal customers.
Enhancing advertising might be useful. For example, Citi Bike can advertise in nearby subway stations to attract people who used to take subways. It could also cooperate with nearby companies, universities or apartment buildings by positioning Citi Bike as a time-saving, money-saving, environmentally friendly and healthy transportation.
Besides advertising, Citi Bike could also consider a more effective pricing strategy. Offering free-trials or providing discounts might be effective in attracting new customers and retaining loyal customers. It could also use dynamic pricing similar to what Uber and Lyft are doing. For example, lower the price in winter when there is a low demand, but at the same time, it should also consider the maintenance cost raised by bike usages in winter. This requires future study on consumer behavior in response to price change.
Suggestions for Future Expansion
Finally, Citi Bike plans on doubling its current service area by 35 miles and triple the number of bikes to 40,000 by 2023. We see a need to expand the service in Queens where there is a high population density but low subway coverage (only one subway train in the circled area). However, a thorough study on the demography is required. Understanding who would be the target customers, where are the needs, and how to attract them will be the key to future success.
Thank you so much for reading about our capstone project! Our project work and presentation materials can be found at the github repository. You can also find our tableau dashboard for dynamic visualizations. Please feel free to reach out to us if you have any questions or would like to discuss more about our project.