How Much is My Used Car Worth?

Posted on May 21, 2018

Maybe you've tried to sell a car, only to find that you couldn't get nearly as much money as you thought! Perhaps you have tried to buy a used car, only to find that it cost much more than you could have imagined! Or you purchased a used car for a reasonable price only to find that the car had unexpected problems that were invisible on the surface!

In this web scraping project, I set out to see which factors affected the resale price of a used car, and by what factor.

To do this, I web scraped Carfax:

Carfax is a website that is well known for checking the history and status of used cars to help used car buyers from being sold a car that has problems that the buyer is unaware of.

Using Selenium, I extracted the URLs of the webpage for each car using the following used filters:

- Sold within a 50 mile radius of New York city

- Under $15,000

Once all the URLS were collected, I extracted the detailed information for each car using scrapy. Here is a sample webpage:

The prices collected are based on dealer selling prices, not the current market value.
Due to time restraints, I collected data for 6747 used cars. It was collected in a CSV file:

I ran several analyses of the data that was collected. First, a scatter plot of price vs. year:

Boxplot of Price vs. Year:

Barchart of Price vs. Year:

I was curious to see why the trend of prices decreasing as the age of the car was not followed in the year 2015. I noticed something interesting when looking at the number of listings per year.

Because many cars are leased for 3 years and returned, the number of cars that are 3 years old being sold is significantly higher! This could be a factor as to why the sale price in 2015 was higher than in other years.

Another large factor in the resale of cars in the mileage. The following is a histogram of the number of used cars sold by mileage:

The density plot:

The values for Mean and Median:

A hex chart of price vs. mileage:

A density plot of price vs. mileage:

When comparing by "make" since the filter of less than $15,000 was set, the data could be skewed for luxury brands in particular, but the following shows a general view of the resale prices categorized by make:

The following is price categorized by the body style:

For those who are green, here is a breakdown of the resale price by energy source. (Note: there was only one result for "Alternative" energy source, so the data may not accurately reflect the category.)

Breakdown by Transmission:

I compared the automatic vs. manual transmission purchases with a two-sample t-test to see if the means were statistically different:

They were statistically different! A car with an automatic transmission sells for approximately $1300 more than a car with a manual transmission.

Many people are concerned with the title status of a new car. Here is the breakdown by title status:

An interesting comparison was between resale values of cars that had accidents, and cars that did not:

It is clear that No Accidents Reported had a significant impact on the resale price:

Ultimately the two-sample t-test showed that the means of these categories were different. The mean difference was $1150.

Finally, based on the data, here is a breakdown of the models with the top resale value:



1. An automatic transmission resells at about $1300 more than a manual transmission.

2. A used car without accidents reported resells at about $1150 more than a car with accidents reported.

Sedans resell better than other body types of used cars. (By observation, t-test not performed.)


These results can be applied not only for those who are interested in purchasing a used car, but also for:

- Those considering purchasing a new car.
- Whether or not to buy or lease a new car.
- Whether to sell a currently owned used car or keep it.

About Author

Anthony Parrillo

A passionate, intuitive problem solver using critical thinking and creative strategies with data to find meaningful insights to deliver practical, profitable results.
View all posts by Anthony Parrillo >

Leave a Comment

Anthony Parrillo June 7, 2018
Thank you for your comment and question Bernardo. Nice graph also! If the purpose of the graph is to display exact precision of year vs. price, then you are correct. My purpose in creating this "jitter" plot was to allow the viewer to more naturally see the overall upward trend between year and price. A jitter plot takes plots for the years (ordinal data) and moves them slightly to the left or right so that all the dots are not lined up vertically. Another purpose in using the jitter plot was to represent the data in a way different from the following graphs which show the data aggregated by year, which results in a more discrete visualization (e.g. boxplot, bargraph).
Bernardo Lares June 7, 2018
Thanks for sharing... quite intereseting. But why, in the scatter plot of price vs. year, you have points everywhere instead of only years? Shouldn't you get something like ?

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI