Where Did All the Bikes Go? An Analysis of NYC Citi Bike Station Capacity
I have been fascinated by the number of people riding NYC Citi Bikes to and from work. I’ve asked a few Citi Bike subscribers about their experience, and yes, some include dealing with a pothole or two; therefore, I was interested in the data logistics. How are the stations replenished and how do users typically deal with bike capacity constraints when they really need a bike. Do they go to a nearby station, or hail a cab?
My second project at the NYC Data Science Academy boot camp (build a web based product addressing a business problem using Shiny and RStudio) provided me with an incentive to further research these questions and drive towards the creation of an interactive web based solution. My cohort peer James Lee told me stories of riding a Citi Bike during his undergrad days. And having to go to a station at certain times (before classes would finish) to ensure that he would get a bike. Otherwise, he would be left to wander on to the next station to find a bike.
My experience with the NYC Citi Bike Station Map website led me to think about how users determine if there will be enough bikes or docks at a given station or destination? And would a ride from point A to point B result in a surcharge, if the ride duration were to go above the purchased pricing models (30 mins for day riders and 45 mins for subscribers)? The NYC Citi Bike Station Map website locate feature allows you to look up one address at a time, which results in a popup on the map displaying the number of available bikes and docks. You can filter on bikes or docks to reduce the number of markers on the map.
However, I felt these features were limited. Features that would allow users to determine bike / dock capacity at the station / destination, use historical data to show capacity trends during peak times, with awareness of the trip duration, would improve the ability to plan ahead. I spoke with a few members of my cohort, my instructor Chris Makris and a small circle of friends to confirm that the pursuit of a solution for this case would be worthwhile. So I went forward to build a Shiny app with these features, which can be found here.
My original goal for the project was to analyze historical ridership data, provided by NYC Citibike through their open data portal. However, my cohort peer, Joshua Litven, referred me to an article that was mentioned by Andy Eschbacher of Carto (was a guest speaker at the Academy the week prior to the project assignment) that provided a detailed analysis on NYC Citi Bike ridership. The analysis by Todd W. Schneider covered ridership trends including peak hours, months, age and sex. Todd’s analysis provided me with an opportunity to move on to the product development stage, using his analysis to influence my development of a prototype dashboard solution that would allow users to track stations meeting their capacity requirements, alongside trip duration and cycling path.
The Shiny project technology scope is limited to RStudio and Shiny Dashboard for the development of a web-based solution within a two-week timeframe. Given the time constraint, I limited my exploration of options to popular libraries that also provided examples related to my project scope. My final working prototype met the project requirements by integrating multiple R libraries, such as:
- Shiny Dashboard for the user interface
- Leaflet for interactive maps
- Google’s gMap for mapping user-entered addresses to geospatial locations and googleVis for charting gauges representing bike station capacity
- RJSONIO for parsing GeoJson data fetched via cURL from NYC Citi Bike and MapBox REST APIs
- GitHub repository for source code version control
- shinyapps.io by RStudio provides PaaS (Platform as a Service) solution for a smooth change release process from my RStudio IDE running on a MacBook Pro to Docker containers that include all the required libraries.
The shinyapps.io free subscription service model limits you to five Shiny applications with 25 active hours per month. This option made it possible for change releases (as part of the Software Delivery Life Cycle) to be visible to users in the public internet domain. A small test group outside of the cohort was able to validate the product and offer immediate feedback. Present state Shiny development solutions are robust for rapid prototyping of ideas with options to scale through a simplified model. But the programming syntax requires further simplification to reduce debugging time due to syntax errors. I believe this will further reduce the learning curve, making Shiny a great alternative for rapid prototyping of web-based solutions to data related problems.
I combined the methodologies from Lean Startup and Project Management, Agile Development principles and DevOps practices to undergo rapid change iterations with focus on hypothesis testing and user feedback, and a fast transfer of technical knowledge from the collective expertise of cohort members and external advisors. As a one man team, it meant that I experienced wearing multiple hats to bring the prototype from concept to fruition: Market Research, Business Analyst, Product Manager, Project Manager, Developer, Production / Customer Support, Sales.
As a result, the first week was prioritized on researching Geographic information systems and dynamic map rendering APIs (Application Programming Interface) with time invested into iterations of trial and error until I found a solution that worked with recently acquired knowledge of R programming. The second week was prioritized on rapid iterations of development and releases to allow for end user testing and feedback. This method was crucial to keeping me focused on developing a product that met initial requirements and was flexible to allow for alignment with the end users' expectations.
My approach to dealing with setbacks and limitations after exhausting the collective knowledge pool, was to park bugs / request into a 'follow-up' / 'would be nice feature' checklist so that I wouldn't get stuck on working to resolve an issue or perfect an implementation. In paraphrasing my instructor Chris Makris, 'Park the pie in the sky feature requests for a future iteration and remain focused on the core needs of the application to meet the timeline' was sound advice I kept close to heart.
The overall process has been an amazing experience for me. I wanted to deliver on a primitive working prototype. Given the two-week timeline, I had to park the historical peak trend analysis feature, but am amazed by the significant growth in geospatial and programming knowledge attained, and by the number of features I was able to integrate into the final prototype based on user feedback.
I will not hide that there were many moments of failure and frustration (e.g. endless troubleshooting of routing coordinates rendering as a polygon shape instead of a straight path on the map) along the way. However, my family, friends, instructors, TAs and cohort peers continued to provide me with the support, guidance and encouragement to see the project through, with the result being a working prototype on the internet to aid users in their research / planning. This type of environment was the deciding factor that led me to partake in the immersive.
Who would have thought that by the end of my first month of training at the NYC Data Science Academy boot camp that I would have produced a functional web-based prototype that addresses a business problem? The training curriculum, cohort collaboration, alongside existing industry research in the public domain, and my commitment and persistence with respect to following best practices, allowed me to step outside of my comfort zone and achieve a working prototype that includes the following features:
- Locate and view capacity for multiple stations
- View travel time between two stations with cycling directions
- Filter on Bike / Dock capacity
- Pin map to your location
- Gauge Bike / Dock capacity by station
- Query NYC Citi Bike station data
- Engage NYC Citi Bike team to assess the working prototype as a proof of concept for integrating the features into their product development pipeline
- Update the bike instructions time reference to reflect hh:mm when duration is over 60 mins (would be nice)
- Schedule automatic data refresh, independent of user updates
- Notify users when capacity for tracked stations change
- Provide users with alternative map tile options
- Allow the end user to enter multiple routing points when mapping their destination
- Provide historical trend analysis on station capacity during peak times (if there is user demand for this feature)
The following people deserve recognition for the support provided to me during the rapid product development process:
- Cohort Peers: Joshua Litven, Frederic Cheung, Chris Valle, Conred Wang, James Lee
- Cohort Instructors / TAs: Christopher Peter Makris, Shu Yan, Zheyu (Sammy) Zhang
- Family / Friends: Yasmin Regalado, Jeffrey Regalado, Carlos Peguero, Cris Macario, Andy Eschbacher, Alexander Ryzhkov, Nelo Fabrizi
- Anne S of Citi Bike Customer Service for granting me with permission to use their Open Data for this purpose