Machine learning Uber vs. Lyft price prediction modeling

Posted on Apr 11, 2023

Introduction

Ever wonder why an Uber ride home from work costs you less than a Lyft for the same distance from the airport to your home?Β  There are multiple factors involved in determining the price of a ride. Aside from demand and supply, time of day, and weather conditions can bump up the cost. What, ultimately, defines the price of your ride?

I analyzed Uber, and Lyft rides in Boston, MA, of a data set of 693,071 rows with 57 defining features from late November through mid-December in 2018. In my analysis, I predicted and compared the price of Uber and Lyft rideshares based on various predictors, such as distance, an hour of the day, day of the month, day of the week, surge multiplier (demand-based pricing), weather features, etc.

Β 

Project description

This project aims to analyze and predict the price of Uber and Lyft to apply data science and machine learning techniques to a real-world problem.

Source of the data:Β https://www.kaggle.com/datasets/brllrb/uber-and-lyft-dataset-boston-ma

With no public data of rides/prices shared by any entity, this dataset contains real-time data using Uber and Lyft API queries at a few hot locations in Boston and corresponding weather conditions. The queries were done on the apps every 5 minutes for 22 days from late November through mid-December in 2018.

I was inspired to work on this project because of its relevance to almost everyone, especially in big cities. People often use taxi services, primarily Uber and Lyft rides, to get around the city either because they don't own a car or because parking makes driving themselves too costly.Β  How these taxi pricing models work and vary by circumstance is practical; gives us a better insight into which service to use.

 

Target audience

The target audience is everyone because almost all people are taxi customers in many different situations.

 

Dataset

We are using the Kaggle dataset rideshare_kaggle.csv with 693071 records with 57 attributes. You can download this dataset from Kaggle.

Link for the dataset: https://www.kaggle.com/datasets/brllrb/uber-and-lyft-dataset-boston-ma

This dataset is a sample dataset for Uber & Lyft rides in Boston, MA. The rideshare data covers various types of cab rides for Uber & Lyft and their price for the given location. We can find out if there was any surge in the price during that time. The dataset contains the corresponding weather data features for that hour including a short and long summary of the weather, for all the locations considered.

 

Data difficulties

The dataset has its limits. The dataset's quality is questionable, which can impact my analysis results.

It seems we have many gaps in the day data. We only have 17 days of November and December in our monthly data. It means the data is only recorded during 17 days in November and December.

It seemed like observations spread almost equally in variables like Product ID, Source, and Destination.

It seems that the quantity of Sources is almost equal. There are about 53 thousand data in each Source feature (Back Bay, Beacon Hill, Boston University, etc.)

Similar to the Source features, there are about 50 thousand points of data in each destination feature (Back Bay, Beacon Hill, Boston University, etc.)

We have 55,095 missing values pertaining to the variable price.

All Uber rides have the surge_multipier variable equal to 1. Lyft surge_multipier ranges from 1 to 3. It's hard to believe that Uber has no rides during busy hours. The surge_multipier variable was removed as a variable with almost zero variance.

 

All those inconsistencies mentioned above create doubt about data completeness.

 

Research question

  • What defines the cab price of Uber vs. Lyft?

 

Completed steps

  • Preprocess/ Clean/ Create new variables
  • EDA of the Kaggle rideshare data set, using R
  • Visualization of data analysis using tidyverse, ggplot2, dplyr, car, readr, mlbench, etc.

 

Exploratory data analysis

No prices were associated with rows with the variable cab type Taxi. After removing missing values, there are 637,976 rides. The analyzed period of the cab rides is 17 days between November and December of 2018. The average price for the ride is 16.55. The average distance is 2.189 miles. The average temperature is 39.58 F.

In this dataset, Uber has more rides than Lyft. Over half (51.82%) of the recorded rides were for Uber, and 48.18% were for Lyft. The difference is not too big; each cab type has about 300,000 points of data.

We can see that Lyft prices are slightly higher than Uber prices. The average Uber price is $15.80, while the average Lyft price is $17.35.

Also, we can see that Uber rides are longer on average than Lyft rides.

The date columns contain some composite information such as day, day of the week, month, and time. Extracting them gives us more granular information to explore. It seems we have almost 24 hours of recorded data in which Uber dominates booking orders.

The number of rides is slightly different between sources and destinations. All cab pickup points were above 8 percent of the total rides.

After removing unusable or irrelevant variables (DateTime, product_id, etc.) of my interests and creating a few new variables (Hour, Day of the week), the dataset contains 16 variables with 637,976 observations.

From a correlogram of the dataset we can see that:

  • Highly correlated - humidity and visibility -0.7, PrecipIntensity & Visibility -0.6, Day and Month -0.9
  • The price correlated with a variable distance of 0.3 and with a Surge Multiplier of 0.2.

Modeling

My goal was to fit a model to predict the ride price of Uber vs. Lyft. I randomly sampled 2000 observations of the dataset because of memory limits and optimization of the processing time in R. After trying to fit a few models, including LM, GLM, GLMNET, CART, SVM, KNN, CUBIST, and GBM, by main metrics of machine models I determined that Cubist is the best ML model for Uber and Lyft price prediction.

Uber:

  • RMSE (train 2.153322, test 1.879026)
  • R-squared (train 0.9385098, test 0.9519049)

Lyft:

  • RMSE (train 2.952509, test 3.135741)
  • R-squared (train 0.912892, test 0.909245)

Cubist for the sample of Uber dataset performed better than for Lyft.

 

The cubist model identified some important price features such as Distance, Product, Hour, Day, Source, Destination, and Cloudy Weather.

The next steps for this machine learning model project can be:

  • Develop a Shiny app for the EDA part of the project to hold a user-friendly speed.
  • Refine the model to improve accuracy, possibly considering the interactions between the variables and work on model stacking.
  • Incorporate additional features, for example, traffic conditions.

 

Β Suggestions for taxi customers

When considering purchasing Uber service from Uber or Lyft, consider the most important drivers of cab price:

Taxi customers pay more for a car with higher quality products. Distance is the primary driver of the ride price.

Source and Destination matter too. The cloudy weather seems to be an important driver for the taxi price too.

Day and Hour matter too. Demand for a taxi during holidays and on Mondays is higher than average. Β The peak hours are between 6 pm and 8 pm for the post-work crowd.

Based on a given dataset, Uber prices are slightly lower than Lyft for Lux products. But Shared and SUV Products from Lyft are cheaper than similar products from Uber.

 

About Author

Diana Dent

A Data Scientist with 15 years of business experience in financial and project management. A proactive, detail-oriented data enthusiast with exceptional problem-solving skills. Interested in contributing SQL, Python, Excel, and Tableau mastery paired with advanced data analytics, machine...
View all posts by Diana Dent >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI