Using NYC Citi Bike Data to Help Bikers Find their Mates

Posted on Apr 26, 2017
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

There is no shortage of data analyses on the NYC bike share system. Most of them aim at predicting the demand for bikes and balancing bike stock, i.e forecasting when to remove bikes from fully occupied stations, and refill stations before the supply runs dry.

 

This is why I decided to take a different approach and use the Citi Bike data to help riders find each other; a kind of Tinder for bike riders...

If you want to skip the analysis, you can check out the app here.

 

The Challenge

Using NYC Citi Bike Data to Help Bikers Find their MatesAs a bike enthusiast, I wish I had a platform where I could have spotted like-minded people who did ride a bike (and not just pretend they did).

The goal of this project was to turn the Citi Bike data into an app where a rider could identify the best spots and times to meet other Citi Bike users and cyclists in general.

 

 

 

 

The Data

Using NYC Citi Bike Data to Help Bikers Find their MatesAs of March 31, 2016, the total number of annual subscribers to Citi Bike was 163,865, and its riders took an average of 38,491 rides per day in 2016 (source: Wikipedia)

That adds up to more than 14 million rides in 2016!

I used the Citi Bike data for the month of May 2016 (approximately 1 million observations). Citi Bike provides the following variables:

  • Trip duration (in seconds).
  • Timestamps for when the trip started and ended.
  • Station locations for where the trip started and ended (both the names and coordinates).
  • Rider’s gender and birth year - this is the only demographic data we have.
  • Rider’s plan (annual subscriber, 7-day pass user or 1-day pass user).

 

Data on Riders per Age Group

Before moving ahead with building the app, I was interested in exploring the data and identifying patterns in relation to gender, age and day of the week. Answering the following questions helped identify which variables influence how riders use the Citi Bike system and form better features for the app:

  • Who are the primary users of Citi Bike?
  • What is the median age per Citi Bike station?
  • How do the days of the week impact biking behaviours?

As I expected, based on my daily rides from Queens to Manhattan, 75% of the Citi Bike trips are taken by males. The primary users are 25 to 24 years old.

Using NYC Citi Bike Data to Help Bikers Find their Mates

Riders per Age Group

 

Data Distribution of Riders per Hour of the Day (weekdays)

However, while we might expect these young professionals to be the primary users on weekdays between 8am-9am and 5pm-6pm (when they commute to and from work), and the older users to take over the Citi Bike system midday, this hypothesis proved to be wrong. Also, tourists seemed to have little impact on usage as the short term customers only represent 10% of the dataset.

Using NYC Citi Bike Data to Help Bikers Find their Mates

Distribution of Riders per Hour of the Day (weekdays only)

 

Median Age per Departure Station

Looking at the median age of the riders for each station departure, we see the youngest riders in East Village, while older riders start their commute from Lower Manhattan (as shown in the map below). The age trends disappear when mapping the station arrival, above all in the financial district (in Lower Manhattan), which is populated by the young wolves of Wall Street (map not shown).

The map also confirms that the Citi Bike riders are mostly between 30 and 45 years old.

medianage

Median Age per Departure Station

 

 

Rides by Hour of the Day

Finally, when analyzing how the days of the week impacted biking behaviours, I was surprised to see that Citi Bike users didn’t ride for a longer period of time during the weekend: the median trip duration is 19 minutes for each day of the week.

tripdurationperminute

Trip Duration per Gender and Age Group

 

However, as illustrated below, there is a difference in peak hours. While the peak hours during the weekdays are around 8am-9am and 5pm-7pm when riders commute to and from work, on the weekends, riders hop on a bike later during the day, with most of the rides happening midday.

 

weekday_weekend

Number of Riders per Hour of the Day (weekdays vs. weekends)

 

 

The App

Where does this analysis leave us?

  • The day of the week and the hour of the day are meaningful variables that we need to take into account in the app.
  • Most of the users are between 30 and 45 years. This means that the age groups 25-34 and 35-44 won’t be granular enough when app users need to filter their search. We will let them filter by age instead.

 

The Citi Tinder app in a few words and screenshots.

There are 3 steps to the app:

  • The "when": find the times and days where your ideal mate is more likely to ride.

step1_when

 

  • The "where": once you know the best times and days, filter out the location by day of the week, time of the day, gender and age. You can also select if you want to spot where they arrive or depart.

step2_where

 

  • The "how': the final step is to grab a Citi Bike and get to those hot spots. The app calls the Google Maps API to show the directions with a little extra: you can compare the time estimated by Google to connect two stations versus the average time it took Citi Bike users. I believe the latter is more accurate because it factors in the time of the day and day of the week (which the app let you filter).

step3_how

 

Although screenshots are nice, the interactive app is better so head to the first step of the app to get started!

 

 

Would Have, Should Have, Could Have

This is the first of the four projects from the NYC Data Science Academy Data Science Bootcamp program. With a two-week timeline and only 24 hours in a day, it was impossible to cover every data angle. Below is a quick list of the analysis I could have, would have and should have done if given more time and data:yeahbike

  • Limited scope : I only took the data from May 2016. However, I expect the Citi Bike riders to behave differently depending on the season, temperature, etc. Obviously, the bigger the sample size the more reliable the insights are.
  • Missing data : There was no data on the docks available per station that could be scraped from the Citi Bike website. The map would have been more complete if the availability of docks had been displayed.
  • Limited number of variables : I would have liked to have more demographics data (aside from gender and age); a dating app with only the age and gender as filters is restrictive...
  • Incomplete filters : With more time, I'd have added a filter 'speed' in the second step of the app (the 'where' part) to enable the hard core cyclists to filter the fastest ones...
  • Sub-optimal visualization : I am aware that the map in the introduction page (with the dots displaying the median age per station) is hard to read and with more time, I'd have used polygons instead to group by neighbourhoods.
  • Finally, I would have liked to track unique users. Although users don't have a unique identifier in the Citi Bike dataset, I could have identified unique users by looking at their gender, age, zip and usual start/end stations.

About Author

Claire Keser

Claire Keser completed her MBA at the University of Victoria (Canada). Her work experience has been primarily in Conversion Optimization (A/B testing) where she built & led a team focused on turning data into products, actionable insights, and...
View all posts by Claire Keser >

Related Articles

Leave a Comment

Using NYC Citi Bike Data to Help Bike Enthusiasts Find their Mate | A bunch of data April 27, 2017
[…] article was first published on R – NYC Data Science Academy Blog, and kindly contributed to […]
Using NYC Citi Bike Data to Help Bike Enthusiasts Find their Mate – Mubashir Qasim April 27, 2017
[…] article was first published on R – NYC Data Science Academy Blog, and kindly contributed to […]

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI