Data Study of Velocity of Citi Bikes
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Contributed by Taraqur Rahman. He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on his first class project - R visualization (due on the 2nd week of the program).
Data shows bikes have become extremely popular type of transportation. It is healthy to ride bikes, no pollution is being released into the air, and the oil prices going up is not a concern. Citi Group saw the opportunity in this and teamed up with Motivate, a global bike sharing company, and launched the Citi Bike program in May 2013. I thought it would be interesting to find out what can be learned from this Citi Bike program.
The dataset that I worked on was the winter months of December 2015, January 2016, February 2016. After looking at the data set, I noticed that there were coordinates, and duration. Right away I was curious to see what the velocity of the Citi Bikes can tell us.
Engineering the velocity feature was simple using the coordinates and the trip duration. A histogram was generated to visually see the frequency of the velocities. This represents the data of tripdurations equal to or under 45 minutes. The annual subscription had an unlimited rides for a time length of 45 minutes. Therefore the focus was within that time limit.
The graph above shows two interesting points. The first point is that there are a few outliers to the right of the graph (barely visible). The max velocity was about 36.9 mph. The average velocity was about 5.752 mph. The next interesting point is the spike at zero. This was expected. If the velocity was zero, it does not mean that people did not ride it; it means that the bike ended up where it started. Once again, the velocity measures the ratio of displacement and time. This is a crucial concept to understand so I would like to take a little time to explain it.3
The displacement is the shortest distance from the starting doc station to the ending doc station. For instance in the map above, if a person picked up a bike at the south Williamsburg doc station, traveled over the Williamsburg Bridge, up a few Manhattan blocks, through the Midtown Tunnel and dropped it off North Williamsburg, then their distance would be the black line. However the velocity is concerned with the displacement, which is the shortest path from South to North Williamsburg, which would be the red line.
Now if the same person decided to doc his/her bike in South Williamsburg (the same station he/she started with), what would the displacement be? The displacement would be zero. Therefore (going back to Velocity histogram), that spike at zero is due to the people who made round trips with the bikes. To confirm this, I made a graph showing the trip duration throughout the day for the bikes with zero velocity.
Even though they have zero velocity, they still travelled for a certain time throughout the day. People make more than twenty-five minute round trips during the times of 10:00 – 16:00. Unfortunately figuring out the velocity for this subset is impossible. But we can imply that these people might use the bikes for a quick workout, run errands, or just take a break and enjoy a bike ride.
Velocity by Time of Day
Speaking of times throughout the day, the box plot below displays the velocity throughout the day. Morning is the time from 4 am to 12 pm. Afternoon is from 12 pm to 8 pm. Night is from 8 pm to 4 am.
The velocity is almost constant throughout the day. Morning has the highest average velocity, which might make sense since people are rushing to go to work. Night has the next highest velocity. I would assume that during the night there is nobody around or no traffic so people tend to go faster.
Velocity by Age
Next is analyzing the age. Looking at the ages, I noticed that there were people who were 105, 115, 125 years old that rented bikes and there were many missing ages. I doubt that the ages were correct. It seems like people just inputed random birth year in the dropdown list when signing up. These age imposters did still ride the bikes so their data can be relevant.
To fix this issue, I set 80 as the cut-off age. Ages greater than 80, I for the rest I entered the mean age (excluding the 81+ year olds) which is 38.6. Mean imputation was used because the mean velocity of the total population is similar to the mean velocity of those who are 80 and under. For the missing ages, random imputation was used. The mean velocity for missing age was 3.96 which is significantly lower than the mean velocity of the whole population. The graph below shows the Velocity by Age after the imputation.
One application I can think of with the velocity feature of the Citi Bikes is to see where are the ‘hotspots’ for the bikers.
If we use the average velocity as the standard velocity and see where in the route’s people’s velocity is below the average, then that means that people are taking their time, cruising through that area. This can have a marketing application. Areas where the velocity is small (people are most likely going for a casual ride), the advertisement can be more focused on laid-back advertisements such as drinking Corona at a beach. On the other hand, when the velocity is big (people are riding fast to their destination), there can be face-paced advertisement such as Usain Bolt running with Pumas.
For further analysis, I would like to consider a whole year worth of data. This way we can see trends throughout seasons. I would assume velocity would decrease during the summer time because more people would like to ride casually to enjoy the weather. Also more people would ride the bikes during the summer. During the winter, people will try to get their destination as soon as possible to avoid the cold weather so the velocity would be higher.