Data Insights on Citi Bike NYC Business
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Over the decade, data shows the number of bike share programs and popularity of such programs have grown drastically, with over 207 million trips taken in the U.S since 2010 . Aside from being a cost efficient and environmentally friendly mode of transportation for short work commutes or excursions for fun, it also offers the benefit of exercise for the rider.
However, from a bike share program operator’s perspective, there are many factors to consider to ensure it is operating smoothly and growing. This includes taking into account the rider experience, rider safety, bike availability, bike maintenance, bike theft/ vandalism, trip pricing, and acquisition of program sponsors.
The objective of our project is to extract insights from Citi Bike NYC (hereafter referred to as Citi Bike) business operations (and other cities taken as reference) to inform the creation of a successful bike share program in another city. By doing so, a prospective company can optimize their business model upon deployment in a new city.
Data Processing and Validation
The main dataset used in this analysis was Citi Bike Trip Histories . Each row/observation represented a single trip. Variables included trip duration, start station, end station, user type, etc. In total there were 142 csv files organized by year, month, and state (NYC or NJ). Analyzing all 99 million observations within Jupyter notebook was deemed computationally infeasible due to the memory limitations of most personal computers. To overcome this limitation, we created two resources.
The first resource was a 5% random subsample of the entire dataset. This subsample allowed us to discover high-level trends throughout program history. For instance, with this resource we were able to answer the following: “Which hour of the day, day of the week, and month of the year are most popular for Citi Bike rides?”
The second resource was a SQL database. By connecting SQL with Jupyter Notebook, we were able to make targeted queries that extracted information from all 99 million observations. For instance, with a query we were able to answer the question “In the history of Citi Bike, how many people have ridden from station A to station B?” In contrast, if we were only using the 5% random subsample we would have only been able to answer “In the history of Citi Bike, what proportion of rides have gone from station A to station B?”
The 5% random subsample was validated through comparing the sampled values of imbalanced features against the true values of imbalanced features. For example, let's assume that from a SQL query we determine 12% of all rides start from the Times Square docking station.
In our random subsample we find that 11.8% of all rides start from Times Square. In the next random subsample we find that 12.5% of all rides start from Times Square. This process was repeated 40 times, and the results for each station were plotted on a histogram to see whether the random sample (11.8%, 12.5%, …) accurately depicted the true value of the unbalanced feature (12%). Upon experimentation with multiple subsample percentages (1%, 5%, 10%), we found the 5% subsample to be representative of the full dataset.
Exploratory Data Analysis
Below are some key points about the Citi Bike NYC program and our takeaways. In certain cases, we were also interested to see how Citi Bike NYC compared with other large city bike share programs. We compared them with the bike share programs in San Francisco and Washington DC (hereafter referred to as SF and DC respectively). By determining consistent trends among multiple cities, bike share program creators can have added confidence in their marketing and operational strategies.
Citi Bike divides its customers into two segments. Subscribers are users who buy for Citi Bike annual membership, whereas Customers are users who pay for a Single Ride, Day Pass or 3-Day Pass.
Footnote: More information on their pricing can be found here.
As we visualize below, Citi Bike and the bike share programs in SF and DC, SFMTA Bikeshare and Capital Bikeshare respectively, are all mainly composed of rides from subscribers, accounting for around 80% of total rides taken. Variations in subscriber-customer ratios could be due to differing population densities in NYC, SF, and DC (27K, 17K, and 10K people/square-mile respectively). A city with higher population density may lend itself more towards bike and foot traffic than a sparsely populated city which necessitates automotive transportation. Variations aside, the majority of bike rides come from subscribers.
The majority of NYC, SF, and DC bike share users are city residents who anticipate/believe that they will bike on a consistent basis.
Hourly demand Data
Based on the hour of day, we see that the hours with highest ridership is between 7-9 in the morning, while the peak hours in the evening are from 5-6pm. Upon further analysis, we found that AM ridership is concentrated in the Financial District / Midtown area, and a large increase in demand can be observed in Central Park between 4-8pm in the evening.
Additionally, there is a difference between weekday and weekends, where the peak ridership is in the afternoons during the weekends, as opposed to weekdays, which peak during commuter rush hours.
We can deduce that the increase in ridership during weekdays is due to the bikes’ use by commuters, while for the weekends bikes are used quite evenly throughout the afternoon.
Monthly demand Data
Throughout the years, there is a consistent trend where we see an increase in the usage of Citi Bike between April and October, which corresponds to the seasonality of weather in New York. There’s a positive correlation with temperatures, aspeople are less apt to bike during the cold months of winter from December through March than in spring and summer.
Looking at the breakdown of users by month, we also see a similar trend in the users of the Citi Bike program. In particular, the number of Customers (defined above) increases as New York starts seeing the warmer weather between April and October.
There is a seasonality inherent to the bike share operations, in that during the warmer months, riding bikes is safer and more enjoyable, and thus more users (both Subscribers and Customers) utilize the bike share program.
The increase of Customers during the warmer months is likely due to tourists who visit New York and utilize Citi Bike during their stay.
Temperature vs. Fleet Size
Supplementing the seasonality of demand for usage of Citi Bike, we see that there is a positive correlation between temperature and Citi Bike fleet size (i.e. both variables moving in tandem). Citi Bike fleet size was defined as the total number of unique bike IDs in a given month divided by the maximum number of unique bike IDs in the prior 3 years. As temperature drops, bike operators have historically moved bikes from the streets to storage, in order to reduce damage (such as rust, or vandalism) and idling of unutilized bikes.
In NYC, Citi Bike has historically moved ~15% of their bikes into storage each winter.
Comparing and looking at the density of bike trip durations in NY, SF and DC, we see that it is very right-skewed, with the majority of the trips under 20 minutes. SF has a higher concentration of short trips than DC. This could be due to the difference in terrain. SF is rated as the 6th most hilly city in the nation.
In contrast, DC and NYC are the 49th and 58th most hilly cities respectively . More likely, though, the differences in SF are due to the small-scale nature of the governmental bike share program. Initially there were only 350 bikes and 35 stations, and a visual inspection shows many of these stations are in close proximity to each other. The trip durations in SF may be less indicative of local users/terrain and more indicative of the scale of the business.
Given the rider behavior, the takeaway for operators is to keep the max distance between stations to be around a 10-15 minute ride.
Bike rebalancing is a process by which bikes are moved from docking stations with surplus to docking stations with shortages to meet anticipated bike demand. Looking at the total bike rides over the years compared to the rebalanced number of bikes, rebalancing is on average around 7% of total bike rides. In other words, for every 100 bike rides that occur, 7 bikes are rebalanced to an alternate docking station.
Through our data exploration, we found that 97.7% of unique rides do not end at the same station. Additionally, the top 10% most popular docking stations accounted for 45% of all bike trips.
Rebalancing is one of the major operational considerations in setting up a bike share program.
Click into our Shiny App here to visualize the bike availability throughout different times of the day.
Citi Bike has five different sources of revenue, with annual membership, sponsorship, and casual membership being the three most important. Together these three categories made up over 85% of Citi Bike’s total revenue in 2019.
The largest source of revenue in recent years has been the annual membership: of the $46.7 million generated by these three sources (in 2019), annual memberships accounted for $24.7 million, or 53%. While this source of revenue has climbed upwards from 2015 to 2017, it has plateaued for the past three years.
The second most important source is sponsorship money. Citi Bike has a total of 9 sponsors, and, as indicated by the name, Citibank is the largest. Over the years, sponsors have, on average, contributed $12.1 million per year to Citi Bike’s annual revenue. In 2019, the revenue decreased due to a dip in sponsorship income.
Finally, there is casual membership, which has grown by 29% from 2018 and 2019, and is approaching the average annual value of sponsorship. Even though casual membership makes up a small portion of total Citi Bike rides, this is a very profitable part of their business considering the high revenue per ride. With greater demand for casual membership, especially in the summer months, this is a part of Citi Bike’s business model that should not be overlooked.
Annual membership makes up the majority of Citi Bike’s revenue, although in recent years its growth has begun to plateau.
Data Modeling and Results
Time Series Forecasting
Next we forecasted Citi Bike’s revenue and ridership, using an ARIMA model, in order to benchmark Citi Bike’s performance during the COVID-19 lockdown. As you can see in the graphs below, both the ridership (top left), and revenue (top right) follow a seasonal trend. Every year, demand peaks in the summer and early fall, drops steeply in the winter, and picks back up in the spring. Each year this cycle repeats itself.
What is interesting about the daily ridership is that aside from the larger more visible annual seasonality, there is also a smaller seasonal trend per week (observed through ACF and PACF plots). Citi Bike demand is higher on weekdays than weekends. However, it was difficult to fit a good model to take into account the larger seasonality along with the smaller seasonality, so we decided to aggregate daily ridership by month to build a monthly-ridership seasonal ARIMA model for our analytical purposes.
Because of the COVID-19 disruption to business, we excluded ridership data starting from March 2020 as the lockdown began mid March, and shows an immediate decrease in bike usage due to the pandemic. We only excluded April 2020 from the revenue, as there is an apparent lag in revenue reporting.
Next, we split the data for training and testing, the last 12 months of each set were used as testing data, and all data prior was used for training. We were able to fit two time series models that gave a 13.9% and 10% MAPE, for revenue and monthly rides, respectively. The two plots on the second row above illustrate the projected revenue and monthly ridership for 12 months afterwards with a confidence interval of 80% and 95%.
Projected Revenue Data
Next we look at 2020 projected revenue, in comparison to 2019 revenue, and actual 2020 revenue observed for the month of April. Based on our ARIMA model, we were expecting Citi Bike to receive revenue comparable to last year, and growth in revenue starting in June. In reality, we see that actual revenue is about half a million dollars less than we expect from observing our model forecast.
It is important to keep in mind that April 2020’s $2.8 million is still comparable to Citi Bike’s revenue April in 2018, just two years ago, when there was not a global pandemic. Considering the entire city of New York was under lockdown and that many were no longer commuting to work, this is actually very impressive. It is fair to conclude that Citi Bike is still able to remain a fully operating business, despite the city wide lockdown.
Changes from April to May 2020
While we don’t have May data to confirm that Citi Bike is on its way back to pre-pandemic business levels, we can observe May ridership data. Again, looking at the projections, actual ridership for both March, April, and May fall short of what we projected.
When we look at the change in demand from April to May 2020, however, we see that demand has more than doubled (number of rides increased by 117%), in contrast to an 8.5% increase in ridership from April to May in2019. Even though April and May ridership in 2020 still falls short compared to last year, demand is rapidly growing to meet pre-pandemic levels. Because of this, we believe that Citi Bike is on its way to returning to business as expected. Based on the evidence, we can conclude that COVID-19 has not had a significant impact on Citi Bike and its current business model thus far.
Sponsorship vs. Advertising
Citi Bike and others organizations have on average contributed $12.1 million per year to Citi Bike. As mentioned above, these sponsorships are the second largest revenue source for the business. Despite the tremendous value of these sponsorships, the year to year giving of donors can be quite erratic (standard deviation of $2.3 million).
From 2018 to 2019 all revenue streams grew except for sponsorship. Sponsorship decreased by $5.2 million and drove the overall business growth rate from 9.8% in 2018 to -2.8% in 2019. In short, sponsorship has provided a large amount of revenue for Citi Bike but has been irregular from year to year.
Our group wanted to investigate whether sacrificing sponsors for general advertising on the bikes (e.g. Broadway musical ads, restaurant ads, etc.) would provide more revenue and greater consistency in year to year financial reporting. To do so, we divided sponsorship by the total number of rides in a given year ($/ride). This Annual Sponsorship per Ride is a proxy for the cost of putting the Citibank logo on each bike. Throughout the years, Citibank and other sponsors have on average paid $1.03 per ride (see figure below).
Two findings emerge from the graph above. First, advertising has become cheaper for Citibank over the years. This is predominantly due to an increase in riders and the popularity of the program. Secondly, if a bike share program was going to start in another city it seems that sponsorship ranging from $0.5 - $2 per ride is the norm. Program creators could request funding from prospective sponsors accordingly.
Advertising Relative Data Comparison
How does the average Annual Sponsorship per Ride of $1.03 compare to traditional advertisements? In other words, is the advertising cost on Citi Bikes currently higher or lower than the going rate for billboards or taxi cab advertisements? One estimate of taxi cab advertising claimed that on any given day, taxi-top advertisements cost $1.75 per 1000 impressions .
We asked ourselves, “On a single Citi Bike ride how many people need to see the bike advertisement for the value of advertising to equate that of the taxi?” The answer was “On each Citi Bike ride, 590 onlookers are necessary for sponsorship to be just as cheap as taxi cab advertising.” To interpret this result let us consider two fictional scenarios.
Cost per Ride
$1.03 / ride
$1,750 / ride
$0.0175 / ride
Scenarios and Recommendations
The first fictional scenario involves Citibank sponsoring an inordinate amount of money per ride ($1,750). In order for sponsorship to be just as cheap as taxi cab advertising, 1 million people would need to see the advertisement on a single ride. This is an unrealistic expectation.
If Citi Bike were receiving sponsorships of this magnitude, our recommendation would be to stay with the sponsors because that deal cannot be met on the open market. In contrast, if sponsors were giving $0.0175 per ride, only 10 people would need to see the advertisement on a single ride for the two advertising types to be of equal value. If Citi Bike were receiving sponsorships of this magnitude, our recommendation would be to switch to the open market because you will likely receive more money there. In reality, the median trip duration is 20 minutes, and it is unlikely that 590 people will view the bike during that time.
Currently, we recommend Citi Bike to stay with their sponsors because these contributions likely outcompete the market value of advertising. We recommend that Citi Bike study the number of impressions made on an average bike ride. With this value in hand, they can either switch to open market advertising or encourage current sponsors to give at a flat rate per ride so that advertising does not continue to become cheaper for the sponsors.
Conclusions and Business Recommendations
Our exploratory data analysis and modelling uncovered transferable business insights which can be organized into three categories; sponsorship, operations, and revenue.
With respect to sponsorships, we found that Citi Bike annually generated $12 million from their sponsors. The sponsorship cost per ride has decreased from $2 to $0.5. If sponsorship costs per ride continue to decrease, we would advise Citi Bike to study the number of Citi Bike impressions made on a given ride. With this value, Citi Bike can determine when it would become more profitable to sacrifice Citi Bike sponsors for open market advertising.
As temperature drops, Citi Bike operators have historically moved bikes from the streets to storage, in order to reduce damage (such as rust, or vandalism) and idling of unutilized bikes. From our analysis, it appears that Citi Bike moves approximately 15% of their fleet into storage each winter. Associated transportation and storage costs should be considered when implementing a bike share program in a similar climate.
Additionally, bike rebalancing is a necessary process to counteract the unsteady flow of bikes to and from various docking stations. We found that for every 100 bike rides that occur, Citi Bike operators rebalance 7. Associated labor and transportation costs should be considered when implementing a docking station bike share model in any city.
Over the course of Citi Bike’s program history, annual subscriptions have accounted for 42% of total revenue. From 2017 - 2019 annual subscriptions have flatlined with a growth rate of approximately zero (-0.2%). Assuming that this plateau of subscriptions is nationally applicable, we recommend that bike share program managers decrease the cost of casual ridership to attract new customers into the subscription base.
Additionally, through comparative analysis of NYC, SF, and DC, we found that the majority of rides are under 20 minutes. Knowing this, program managers could offer generous subscription rules (unlimited rides less than 3 hours) while knowing that very few people will actually take advantage of these benefits.
Additionally, Citi Bike has not been entirely immune to COVID-19 financial impacts. Despite low April 2020 revenue in comparison to projected values, the $2.8 million observed is still comparable to Citi Bike’s April 2018 revenue. There also appears to be a nation-wide increase in demand for bikes since the start of the pandemic, as people look for socially distanced alternatives to traditional modes of public transportation .
In light of this, it is fair to conclude that Citi Bike is still able to remain a fully operating business, despite the state mandated lockdown. Assuming similar levels of population density, other bike share program managers can expect the continuation of a viable pandemic business model as people move from crowded public transit to comparatively safe bike transit.
Thank you for taking the time to read our blog post!
We are the Data Science team for Schwinn. Our Git repository can be found here. Please click on our individual blog links below to reach out or to read other work we’ve done.
Jessie Wang - www.linkedin.com/in/jwangxin
Alex Tin - www.linkedin.com/in/alexandertin
Michael Link - www.linkedin.com/in/data-science-link
Catherine Tang Kim Sin - www.linkedin.com/in/catherine-nicole-tang-kim-sin-a25a64192
Featured Image - Photo by Anthony Fomin on Unsplash