Machine learning Uber vs. Lyft price prediction modeling
Introduction
Ever wonder why an Uber ride home from work costs you less than a Lyft for the same distance from the airport to your home? There are multiple factors involved in determining the price of a ride. Aside from demand and supply, time of day, and weather conditions can bump up the cost. What, ultimately, defines the price of your ride?
I analyzed Uber, and Lyft rides in Boston, MA, of a data set of 693,071 rows with 57 defining features from late November through mid-December in 2018. In my analysis, I predicted and compared the price of Uber and Lyft rideshares based on various predictors, such as distance, an hour of the day, day of the month, day of the week, surge multiplier (demand-based pricing), weather features, etc.
Project description
This project aims to analyze and predict the price of Uber and Lyft to apply data science and machine learning techniques to a real-world problem.
Source of the data: https://www.kaggle.com/datasets/brllrb/uber-and-lyft-dataset-boston-ma
With no public data of rides/prices shared by any entity, this dataset contains real-time data using Uber and Lyft API queries at a few hot locations in Boston and corresponding weather conditions. The queries were done on the apps every 5 minutes for 22 days from late November through mid-December in 2018.
I was inspired to work on this project because of its relevance to almost everyone, especially in big cities. People often use taxi services, primarily Uber and Lyft rides, to get around the city either because they don't own a car or because parking makes driving themselves too costly. How these taxi pricing models work and vary by circumstance is practical; gives us a better insight into which service to use.
Target audience
The target audience is everyone because almost all people are taxi customers in many different situations.
Dataset
We are using the Kaggle dataset rideshare_kaggle.csv with 693071 records with 57 attributes. You can download this dataset from Kaggle.
Link for the dataset: https://www.kaggle.com/datasets/brllrb/uber-and-lyft-dataset-boston-ma
This dataset is a sample dataset for Uber & Lyft rides in Boston, MA. The rideshare data covers various types of cab rides for Uber & Lyft and their price for the given location. We can find out if there was any surge in the price during that time. The dataset contains the corresponding weather data features for that hour including a short and long summary of the weather, for all the locations considered.
Data difficulties
The dataset has its limits. The dataset's quality is questionable, which can impact my analysis results.
It seems we have many gaps in the day data. We only have 17 days of November and December in our monthly data. It means the data is only recorded during 17 days in November and December.
It seemed like observations spread almost equally in variables like Product ID, Source, and Destination.
It seems that the quantity of Sources is almost equal. There are about 53 thousand data in each Source feature (Back Bay, Beacon Hill, Boston University, etc.)
Similar to the Source features, there are about 50 thousand points of data in each destination feature (Back Bay, Beacon Hill, Boston University, etc.)
We have 55,095 missing values pertaining to the variable price.
All Uber rides have the surge_multipier variable equal to 1. Lyft surge_multipier ranges from 1 to 3. It's hard to believe that Uber has no rides during busy hours. The surge_multipier variable was removed as a variable with almost zero variance.
All those inconsistencies mentioned above create doubt about data completeness.
Research question
- What defines the cab price of Uber vs. Lyft?
Completed steps
- Preprocess/ Clean/ Create new variables
- EDA of the Kaggle rideshare data set, using R
- Visualization of data analysis using tidyverse, ggplot2, dplyr, car, readr, mlbench, etc.
Exploratory data analysis
No prices were associated with rows with the variable cab type Taxi. After removing missing values, there are 637,976 rides. The analyzed period of the cab rides is 17 days between November and December of 2018. The average price for the ride is 16.55. The average distance is 2.189 miles. The average temperature is 39.58 F.
In this dataset, Uber has more rides than Lyft. Over half (51.82%) of the recorded rides were for Uber, and 48.18% were for Lyft. The difference is not too big; each cab type has about 300,000 points of data.
We can see that Lyft prices are slightly higher than Uber prices. The average Uber price is $15.80, while the average Lyft price is $17.35.
Also, we can see that Uber rides are longer on average than Lyft rides.
The date columns contain some composite information such as day, day of the week, month, and time. Extracting them gives us more granular information to explore. It seems we have almost 24 hours of recorded data in which Uber dominates booking orders.
The number of rides is slightly different between sources and destinations. All cab pickup points were above 8 percent of the total rides.
After removing unusable or irrelevant variables (DateTime, product_id, etc.) of my interests and creating a few new variables (Hour, Day of the week), the dataset contains 16 variables with 637,976 observations.
From a correlogram of the dataset we can see that:
- Highly correlated - humidity and visibility -0.7, PrecipIntensity & Visibility -0.6, Day and Month -0.9
- The price correlated with a variable distance of 0.3 and with a Surge Multiplier of 0.2.
Modeling
My goal was to fit a model to predict the ride price of Uber vs. Lyft. I randomly sampled 2000 observations of the dataset because of memory limits and optimization of the processing time in R. After trying to fit a few models, including LM, GLM, GLMNET, CART, SVM, KNN, CUBIST, and GBM, by main metrics of machine models I determined that Cubist is the best ML model for Uber and Lyft price prediction.
Uber:
- RMSE (train 2.153322, test 1.879026)
- R-squared (train 0.9385098, test 0.9519049)
Lyft:
- RMSE (train 2.952509, test 3.135741)
- R-squared (train 0.912892, test 0.909245)
Cubist for the sample of Uber dataset performed better than for Lyft.
The cubist model identified some important price features such as Distance, Product, Hour, Day, Source, Destination, and Cloudy Weather.
The next steps for this machine learning model project can be:
- Develop a Shiny app for the EDA part of the project to hold a user-friendly speed.
- Refine the model to improve accuracy, possibly considering the interactions between the variables and work on model stacking.
- Incorporate additional features, for example, traffic conditions.
Suggestions for taxi customers
When considering purchasing Uber service from Uber or Lyft, consider the most important drivers of cab price:
Taxi customers pay more for a car with higher quality products. Distance is the primary driver of the ride price.
Source and Destination matter too. The cloudy weather seems to be an important driver for the taxi price too.
Day and Hour matter too. Demand for a taxi during holidays and on Mondays is higher than average. The peak hours are between 6 pm and 8 pm for the post-work crowd.
Based on a given dataset, Uber prices are slightly lower than Lyft for Lux products. But Shared and SUV Products from Lyft are cheaper than similar products from Uber.