Data Analysis on NYC Taxi Riders' Tipping Behavior
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
To Tip or Not to Tip, That is the Question!
In this blog post, we would like to study the data of both NYC yellow and green taxis focusing on New Yorkers' tipping habits. The tipping behavior reflects customer satisfaction/dissatisfaction with their rides. We will contrast the differences between the green taxi vs yellow taxi passengers' tip-paying habits.
In 2013, the TLC (Taxi Limousine Commission), under then mayor Bloomberg, laid out a new program introducing a new taxi system painted 'apple green' on their exteriors. They are the second class citizens in the world of NYC taxis in that they cannot compete with the yellow taxis within the protected 'yellow zone' (below E 96th street and W 110th street). They are only allowed to pick up passengers outside of this “yellow zone”. Green taxis serve New York city by going to places the yellow taxis drivers prefer not to go.
From the yellow taxi (shown above) and green taxi (shown below) origination plots, we observe that the yellow taxis originations focus on lower Manhattan, JFK and LaGuardia airports and the green taxis focus upon upper Manhattan and concentrated areas in both Brooklyn and Queens.
Data Analysis on NYC Taxi Riders' Tipping Behavior
Because the combined data set of yellow/green taxi data is quite large (~25Gb), we need to handle the yellow taxi data by the batch mode (It is too big to fit into the RAM memory of our laptop!). At the end we show some of our code-link to demonstrate our techniques. We use the green taxi data set, which is more than 1/8th the size of the yellow taxi data, as our yard-stick-----We develop our ideas and observations starting from the green taxi data, which saves a lot of the energy when wrangling with the bigger yellow taxi data set.
Section 0. The R packages and R codes in our Data Analysis and Visualization
For those who are interested in the technical aspects of our study, we highlight our approach. Due to the length limitation of the current blog, we present only less than 10 percent of our full study in the main section of our blog.
In this project we use the NYC yellow and green taxi datasets, which can be downloaded from the green and the yellow taxi data source
We use R-language in the R-Studio environment, with various packages to handle the data aggression, geo-spatial analysis. We put 'dplyr' as our workhorse on data-aggregation/manipulation, 'geojsonio' and 'sp' to handle the conversion of the GPS location data, 'ggplot2' and 'ggmap' to handle plotting.
Technically, the taxi data has no missing values in their '.csv' files.
But the data is very dirty, which needs a lot of cleaning. The sources of the dirtiness comes from
- Unrealistic fares, travel speed
- Negative duration or trip distance
- Wrong GPS coordinates
- Long distance rides to the other major cities disguised as taxi rides
, and the others. So the reader who starts to study the taxi data does have to pay special attention in cleaning the data.
The main bottleneck to handle the two large data sets on NYC taxi in using R-studio environment is about the RAM memory of the macbook pro laptop we use. All the operations on R-data.frames assume we load all the data into the memory in one shot. We also need additional space to manipulate the raw data.
Yet the combined size of the yellow taxi files in 2015 is clearly beyond the capacity of a typical laptop. People often compromise either by shortening the time period of the samples, or by selecting a shorter list of column variables to study, both effectively cutting down the memory requirement. Both of these approaches have negative side effects. The former reduces the statistical significance of the study, the latter often ignores some interesting patterns among different type of columns if some of them have been removed from the very beginning.
We remark that the TLC's version of the yellow taxi data is broken down into 12 monthly files, each of them is of size about 2GB. Instead of handling the data in one shot, we process the monthly data iteratively in a batch mode. Here the green taxi data plays a key role here. Because its smaller size, we can load them into RStudio in one shot. As both sets of data have some degree of similarities, the insights we gain by studying the green taxi data first provides us important clues how to design the batch processing codes for the yellow taxi data.
If the reader does not intend to include the green taxi data into his/her study, one can sample about 10% (or less) from the yellow taxi monthly data as a toy example to start the project. Even though the random sampling may introduce additional statistical noise, but it still provides useful insights about the full data set.
To process the monthly yellow taxi file, we use the simple R-packages, 'saves' and the dump function in the base R. The 'saves' and 'loads' allow us to split a multi-column data frame into many 1-column data frames and store them in a tape archive of native R binary format files.
This allows us to load the selected columns of the huge data frame on disk to memory quickly. The dump function allows us to save whatever aggregation computations of each month in an appending mode, to the same .R file. We can re-populate those monthly aggregated computation into the global enviornment by a simple 'source(filename)' command.
Finally we remark that we view each monthly yellow taxi file as a virtual block and write operations to work on the list of twelve blocks (semi)-mimicking the map-reduce operation common in big data processing. We first aggregate the computations to the monthly level and write everything to the disc. In the analysis mode we 'source' the previous computations and do the second layer of reduction.
We guide the readers by walking through several key portions of our R-codes. A portion of the green taxi and yellow taxi analysis is completely parallel, We show only one of them to avoid redundancy.
To process the raw green taxi data and to plot the 2D heat maps of the green taxi speed,
To analyze the traffic within and across boroughs, we have
cross borough traffic analysis
To analyze the tips and the change w.r.t. speeds and the other factors, we have
For the yellow taxi, the hard part is to process the data while it does NOT fit into the memory.
We have the for loop to process them month by month,
yellow taxi data processing loop
Once the data has been processed, we can start a new session or refresh the existing session to perform analysis,
The analysis after the data is stored into the hard disc
The above analysis relies on several lower level UDFs, we show them in the following,
The preprocessing of the yellow taxi data, which convert the data columns into the formats we want, clean the outliers, remove the long duration trips to other cities.
The companion function to do the data aggregation and dump them into a single R file,
To read the data on hard disc back to memory for further analysis, we need two utility functions,
Aggregates the monthly statistics to yearly statistics
Convert from monthly sum statistics to the whole year mean statistics
Section 1. To Tip or Not to Tip, That is the Question!
In this section, we study the taxi passengers' tipping behavior, which is
a proxy for customer satisfaction or dissatisfaction toward their rides. We will show the readers that the tipping involves interesting and unusual human behavior/thinking.
Firstly, we display the 2D heat maps for both yellow and green taxis, showing the readers that the tipping percentages display a non-trivial fluctuation throughout the hours and weekdays. The color scale of the yellow cab tipping percentages ranges from 10% to 15%, while the green one scales from 6% to 10%.
We adapt the following convention of our diagram display: When we display a pair of diagrams consecutively for both the yellow and the green taxis, the yellow taxi plot always comes first, the green taxi plot follows.
Unlike what we would have imagined naively, the passenger tipping behavior displays strong intraday ups and downs. The passengers do not routinely give out the same average percentage of tips from morning to midnight.
Instead, both the green cab and yellow cab passengers tip more generously during the morning rush hours, evening rush hours and late into the night. We also see the spillover effect from the previous nights into the after midnight hours. This is particularly significant from Tuesday to Friday. One major distinction between the green cab and the yellow cab passengers is that lower Manhattan yellow taxi passengers pay relatively low tips during the weekends, but the green taxi passengers do not discriminate the weekends as much.
In the above 2D heat maps, we report the average tip percentages across all passengers, after cleaning the data (e.g. removing unusually long trips, input error). We need to point out to the reader that within both of the passenger groups, 50% plus people pay by cash, and slightly less than 50% pay by credit card.
Because tip income is taxable, the taxi drivers have no incentive to punch in the numbers voluntarily if they are not recorded by the electronic system automatically. No wonder most of the tip records on cash tips are zeros and they average around 2% or less among all cash payers. To make our analysis more meaningful, we need to treat the cash payers' tips as 'missing values'.
From now on, we will focus on the tipping behavior of credit card payers for both yellow and green taxi passengers.
Question: What factor drives the typical passengers to pay higher tips or lower tips?
In both of the tip percentage heat maps, we notice that the tipping percentages depend strongly on time of the day. The passengers seem to acknowledge it is more difficult to drive in the rush hours and adjust their tipping percentages upward. At least for lower Manhattan passengers, they recognize
the ease of driving during the weekend and discount the tip percentages downward during the weekends. Beyond these simple observations, what can we say about the passengers' tipping behaviors?
As it turns out, we find with surprise that the tipping percentages are strongly related to the travel speeds the passengers experience. To demonstrate this, we classify the multi-million samples of yellow cab and green cab rides into different speed bins, by the intervals of 0.5 miles/hour increments. Then we compute the average tipping percentages in each bin and plot the resulting speed vs tip percentage as scatter plots. The average tip percentage within each bin can be interpreted as the consensus on tipping of all the passengers experiencing that particular taxi ride speed.
We have to remove the statistical outliers to prevent them from introducing too much noise into our analysis. In our data sets, there exist overly generous people who pay 200% of their fares as tips. Salute to the good luck of their drivers, but we do not want these samples to pollute our analysis. Because the standard deviation of the tip-percentages is about 12%, we remove those tipping samples which are above tip-percent=50% threshold in both of the data sets, in order to focus on the more typical tipping behaviors.
These scatter plots are shockingly good in that they do not look like 'scatter' plots at all.
Instead, all the points more or less fall onto non-linear curves which can be identified by the naked eyes. This is a strong indication that the relationship between speed vs consensus tipping percentages is strong and the statistical noise is minimal. We remark on the major characteristics of both of the above plots. The lower green tip plot displays a strongly inverted V shape characteristic, which also appears to a less extent in the upper yellow tip plot. Besides the main branch which drops decreasingly in high speed, in the yellow cab plot, there is a minor branch which is monotonically increasing with respect to speeds.
We remind the readers that each data point in the above tip-percent vs speed bin plots represents the aggregation of different numbers of data points. We color them differently in the above plots. The bar plots below tell us that the yellow cab ride counts distribution peaks below 10 mph while the green cab ride counts distribution peaks
within the 10-15 mph range, both distributions skew toward the right. This is the speed (range) that most passengers experience.
To gain the key insight about the plots, we need to provide a rational interpretation of the passengers' behavior. When the speed is lower than 10 miles, the consensus of passengers show a certain degree of dissatisfaction. The average tipping percentages start lower and monotonically increases as the speed increases. it peaks when the speed reaches near the 'optimal' values, which is about 12.5 mph for the card-paying yellow cab riders, and 14 mph for the card-paying green cab riders.
This is completely rational because all taxi passengers on earth want to get to their destination quickly. What seems surprising is that beyond the 'optimal' driving speed, the passengers' tip percentages start to drop! What on earth is the rational explanation for such behavior? Don't we encourage higher speed to reach our destination ASAP?
After some reflection, we realize that the typical passengers take into account the local poor traffic conditions and reward the taxi drivers when the traffic is jammed. But in the 5 boroughs in NYC, where can we drive with high speeds greater than 20 mph? They are often using the expressways or the highways. For example, the green cab rides with speeds in [20,50] mph range are often (but not always) the rides to the La Guardia airport. For these rides, the passengers' consensus seems to discount the tip-giving based on the ease to drive in that kind of traffic environment.
We classify the average tipping percentages of all card holders further into two groups. Among the card payers, not all people pay tips for their trips. The final average tip percentages are influenced by two key factors. The first is the tip percentages of those who actually pay tips.
The second is the rate of card paying passengers who do NOT pay tips for the trips. We do not know for sure if the non-tipping behavior is long term habit. So we assume the decision is on a per-trip basis. The average tip percentages we have observed are impacted by both of these two factors. The second one is very important in that it is a proxy for the passengers which are unhappy about the service they have received. There is often a psychological threshold for a non-tipper to flip into a tipper or vice versa. Studying these will give us important clues about the financial feedbacks of the passengers to their drivers.
New York yellow taxi data has been the subject of many news reports, and many blogs since it became open a few years ago. The relative large data size also attracts the attention of the big data community using it as a toy example in testing their parallel algorithms. In comparison, the smaller green taxi data does not seem to attract an equal amount of attention. This is evident from the relatively low number of data downloads from the NYC open data web site. In this blog article, we demonstrate that the less well known green taxi data also contains interesting human behavior related information.
We find out that the NYC taxi riders are mostly influenced by the speed, the travel hours, and the duration of the trips in making their tipping decisions about whether to pay tips, or the amount to tip them. It is intriguing to see the diversity of the human tipping behaviors among the different NYC residents living side by side to each other. We hope that the above research on taxi traffic and tipping behavior can contribute to the communities of the data enthusiasts on the taxi data set, and more generally the overall residents of New York city.