Rideshare Market Data Analysis in Boston for Uber vs. Lyft
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
What is this project about?
Based on data, the seemingly overnight success of Uber in the early 2010s marked the inception of the rideshare market, which would soon catch on to become a billion dollar global industry and, more noticeably, ingrained into the transportation infrastructure of many countries.
Throughout the last 10+ years, marketshare has been a head-to-head battle between the leading players -Uber and Lyft- and, with such a universally used service, the decision between the two frequently finds itself coming up as a conversational point among users. As such a young industry, it definitely came as a surprise that, when COVID-19 spread across globe in Spring 2020, the rideshare market was at the front of most minds when prompted on hardest hit businesses.
Mass transition to working from home
The mass transition to working from home paired with the temporary (and sometimes permanent) closure of many restaurants, bars, gyms, events, etc. resulted in an increasingly static society and a decrease in transportation demand. Drops in total number of trips ranging from 60 to 70 percent in many major cities left those drivers who previously depended on anywhere up to 20 to 30 fares per day essentially empty handed looking for steady streams of income elsewhere.
Fast-forward a year and a half later and strict shelter in place policies along with aggressive vaccination rollout initiatives culminated into the first move towards normalcy in the lifting of the mask mandate in many states. The return to in-person working environments, the reopening of dining, and increased levels of travel quickly accelerated, however, was hindered by the absence of a pre-COVID convenience surfacing the question:
How do I get there? It has become apparent in many major urban areas across the the US that the rate of reopening has outpaced the rate of driver return to an already severely depleted pool. Habitants of Boston, MA have been especially vocal about the issue citing extremely long wait times, drastically fluctuating fares, and general lack of reliability across both leading rideshare platforms.
The current heightened inconsistency of rideshare offerings inspired my idea to analyze both Uber and Lyft from a user's perspective in attempt to uncover the 'better' platform based on variables including price, location, and time.
Due to the short nature of the project, I decided to streamline the data sourcing process by searching for an existing dataset online and resist the scraping process under the assumption that it would've left little time for the analysis portion. Luckily, I was able to find a dataset containing information on roughly 700,000 rides in Boston from both Uber and Lyft.
The inconvenience of sourcing data from an existing set showed in the relevance of the data as the rides accounted for took place over a 22 day window in 2018 between November 26th and December 18th. This information obviously wouldn't be telling for current rideshare conditions, although, could act as an interesting benchmark to compare post-pandemic Uber/Lyft trends to in future works.
Observations about Data
What originally caught my eye about the dataset were the variables included. Since the focus of the analysis was comparing leading rideshare players, it was extremely convenient that it specified which rides were Uber and which were Lyft. From a high level, the additional variables of interest included the date and time of the ride, ride source (pick-up location), destination (drop-off location), name of the ride type (i.e. UberPool, UberX, Lyft Lux), price of the ride, total distance, surge multiplier, and longitude/latitude.
To supplement these variables, the dataset also consisted of intensive descriptors of the weather at the time of pick-up/drop-off . The first step of the data cleaning process was to discard these columns as none of them had strong correlations with any other of the variables of interest and, since the data collected was during winter in New England, there wasn't a lot of variation in weather.
Narrowing the relevant columns proved to be the largest part of the cleaning as the data was very usable with easily avoidable null values (only existing in the 'price' column). Since the price of the ride acted as the cornerstone variable of the analysis, rides without a price metric wouldn't offer much value.
Fortunately, due to the high number of observations (~700,000) relative to those with NA values (~60,000), I was able to completely discard those readings while maintaining equal representation and sufficient rides (~640,0000 left).
While contemplating the types of questions I'd be interested in exploring before settling on a data set, often times they included subjective comparison phrases such as 'best' or 'better'. After committing to the dataset, given the variables included, I decided to use price as the primary indicator when considering one rideshare service over the other. Subsequently, the following questions became the focus of the analysis:
- What is the overall distribution of the amount of rides per neighborhood (Uber vs. Lyft)?
- Which neighborhoods in Boston have the highest and lowest rideshare costs (Uber vs. Lyft)?
- What time of day are rides the most and least expensive (Uber vs. Lyft)?
- How to rideshare costs compare based on the day of the week (Uber vs. Lyft)?
- How do prices vary for the various types of ride offerings that Uber and Lyft offer?
Before diving into the findings of the analysis, I'd like to note that, besides price, other considerations that could differentiate one platform from another could include driver assignment time, wait times for pick-up, user-interface of corresponding app, etc.
Moving forward with cost/price as the underlying index constituting one platform over the other, evaluating price for each ride from Uber and Lyft alone was not a sufficient measurement as the rides' distance and duration varied greatly. Without data documenting the duration (total time) of each ride, the logical approach was to create a variable, price per distance ($/mile), to evaluate price of each ride regardless of the distance traveled. Price per Distance, or PPD, was the performance metric I used for the majority of my analysis.
Going into the analysis, the underlying assumption was that neither Uber or Lyft specifically targeted rides based off overall distance or price.
The figures above show each of the rides given by both Uber (blue) and Lyft (purple) over the 22 day span. The two plots highlight the first inconsistency realized with the data. It was quickly realized that, for the surge multiplier variable, each of the rides documented from Lyft had varying surge multipliers ranging from 1 to 2. On the other hand, each of the rides documented from Uber had a surge multiplier of 1.
An internet search regarding Uber rides in the time period the data was collected didn't yield any results suggesting Uber had temporarily suspended surge pricing resulting in two hypotheses: (1) the total prices for each Uber ride already accounted for surge pricing or (2) the surge prices were incorrectly gathered and the corresponding total pricing for each ride was lower than they would've been with correct surge pricing considered. For the sake of the rest of the analysis, I decided to take the the former as true (Figure 1).
To reinforce what can be gathered from Figure 1 from a pricing perspective, Figure 3 confirms that throughout the period of analysis, Uber on average had lower overall pricing (ignoring distance) in comparison to Lyft.
Getting a closer look at the middle tendencies of ride distance for either rideshare platform, I found that the takeaways from Figure 5 seemingly contradict what can be observed from Figure 1, suggesting that, Uber rides were on average shorter than those of Lyft. Plotting price against the distance of all Uber and Lyft rides, while very high-level, raises some important questions and considerations. Is Uber always a cheaper option when compared to Lyft? Does the unequal representation of Uber vs. Lyft in rides above 6 miles hold true city-wide? Country-wide? Even world-wide?
With a better understanding of both Lyft and Uber's pricing and distance metrics, my analysis transitioned to uncovering the factors that influence price per distance starting with time. Should one choose Uber vs. Lyft depending on the general time of the day (morning, afternoon, evening, night), hour of the day, or day of the week?
Figures 6 and 7 would suggest that Uber and Lyft pricing per distance remain very similar throughout the day of the week and general time of day yielding parallel plots. We find the two services diverging with respect to time when categorized by the specific hour of the day (Figure 8). One would imagine significant changes in prices across all players in the industry depending on the overall amount of movement.
For instance, during the week, to capitalize on the varying demand for rides, you'd expect prices to spike during morning commutes to work and evening commutes back home. On the weekend nights, it wouldn't be wrong to assume prices to rise when crowds travel to dinner plans, stay moderately high throughout the night, and then rise again late night to account for crowds returning.
The discrepancy in Figure 8 demonstrates a potential difference in pricing per hour across the two companies. Uber, with much more variance in their prices (~$1 change from avg. high and avg. low), more so aligns with the expected trends in demand seen through pricing jumps between 5-6am (commute to work), 11am (lunch), 4pm & 6pm (commute home), and 9pm (return home from dinner/bar).
By strategically raising their prices during peak demand, they're able to undercut competition for what they would consider to be the slow hours of the day. Conversely, we can see that Lyft's pricing strategy pursues consistency ranging less than 30 cents in max and min hourly averages for all hours of the day.
Transitioning away from time, the next thing I compared the average PPD against was location.
Similar to the issue I ran into earlier in the analysis concerning surge pricing, location presented it's own problems as the longitude and latitude coordinates provided for each ride (a) didn't indicate whether these were the coordinates for the pick-up or drop-off location and (b) were simply inaccurate. It became apparent after plugging a few into Google maps and finding that they didn't match up with the correct ride source or destination associated with each ride.
To adjust, I accepted the separate pick-up and drop-off locations as true and replaced the provided longitude and latitude coordinates with those found on a supplementary GeoJSON file of Boston neighborhoods. Since the neighborhoods on the GeoJSON file slightly generalized the original list, certain neighborhoods on the original dataset were merged into others reducing the total count from 12 to 6 (Fenway, Back Bay, Beacon Hill, West End, North End, and Downtown).
Figures 9 and 10 above display the total number of rides per neighborhood for Uber and Lyft. Other than the disparity in the total number of rides across the two rideshare apps, the distribution is nearly identical. Instead of illuminating the differences between Uber and Lyft, the value of the heat maps instead exhibit the high and low ride densities throughout the major areas of the city; Downtown having the most demand followed by the Fenway region.
With very comparable price per distance metrics in the time analysis, going into the location portion of the project, I wasn't expecting any compelling results. Continuing with the theme of higher ride pricing for areas with a high ride demand, one would expect Downtown and Fenway to have the highest mean PPD values. Consistent for both Uber and Lyft prices, we discover that the projection is partially true.
Downtown is the clear leader averaging roughly $12/mile but the real inconsistency arises in the values found for rides originating in Fenway. With the second most rides leaving from Fenway, the rides are clearly the cheapest for both Uber and Lyft averaging roughly $7/mile.
On a separate note, I do think it's worth mentioning the differences in the two platforms for the PPD by source. If the results showed consistency in pricing across all neighborhoods (i.e. cheaper across all areas for Uber or Lyft), one could induce that location does not play a major role in predicting rideshare prices. Alternatively, we notice that Uber rides are on average cheaper in Back Bay, Beacon Hill, and Fenway with Lyft claiming the North End and the West End. While the differences are minor, further investigation could reveal larger gaps.
Continuing the conversation regarding the number of rides-PPD paradox between Downton having high pricing and high demand and Fenway having low PPD and high demand, the explanation most likely comes when viewing the average ride distance by source in Figures 13 and 14 above. Both Uber and Lyft rides from Fenway top the rest of the neighborhoods promoting the idea that as the distance increases per ride, the PPD decreases.
Due to the short nature of the project, the depth of the analysis is limited but I do image there are some valuable insights which could be immediately implemented from the Boston rideshare user angle and investigated further from the perspective of the leading rideshare players such as Lyft and Uber.
I can confidently recommend to your average rider to avoid Uber during peak ride demand hours opt towards Lyft. Conversely, during non-peak hours, Uber emerges as the better option sinking their prices below the more consist fare from Lyft. Additionally, despite minor margins in PPD per pick-up neighborhood, choosing the cheaper option based on the above heat maps could result in savings of a couple dollars.
From the corporate angle, stemming from the same observation cited for rideshare users, companies such as Lyft and Uber could utilize the patterns of relative lower pricing and implement them in an advertising campaign. I'd imagine Lyft would have the higher ground in this instance being able to publicize lower rates during peak hours.
The second idea stems from having an understanding of high demand areas with traditionally low pricing. Areas such as Fenway would be an ideal location to differentiate your service from your competitors as there is large enough of a market to benefit from. Loading the Fenway neighborhoods with additional drivers ultimately cutting wait times would a great way to sway a customer base from one product to another without increasing pricing.
There are certainly many different ways to extend the surface level insights gathered from this limited, outdated dataset. Given the current climate of the rideshare market outlined in the introduction, an analysis of trends has never been more relevant due to demand returning to pre-pandemic levels with a severe lack of drivers to capture business. In my opinion, the best way forward would be either finding a very recent (within the last year) existing data set and/or web scrape your own data.
Recency combined with an elongated timeframe would result in relevant, reliable, and ultimately actionable insights. A major variable that I often found myself wishing was included in the data chosen was time- both duration of the trip as well as wait time for driver. Uber and Lyft hold their user and driver charging/pricing structure near to them. While this analysis focused on price per distance, it is no secret that Uber and Lyft also pre-calculate their rates based on time to account for rides in busy urban areas where traffic plays a major role.
Wait time, especially now, would be very interesting to investigate with many users settling on one platform over another strictly based on reliability, not price. Lastly, although it was included in the dataset, I am extremely skeptical of the surge values provided. The underlying cost of Uber/Lyft rides can sometimes be dwarfed by the extra charges doubling, even tripling, the initially stated cost. Discovering slight margins in PPD could mean next to nothing in comparison to surge prices, which I wouldn't be surprised to be the end-all-be-all when analyzing user cost.