Popularity and Price evaluation of used cars using webscraping

Avatar
Posted on Aug 28, 2017

 

Carmax is the largest used cars retailer in the United States with stores across the country. Thousands of cars are found in each of these stores and they are traded by buyers and sellers every day.  This project will help those who are planning to buy a used car and intends to know a precise price range for the used cars based on mileage and year of the car. As is known, the price can vary drastically for a car with the same characteristics from location to location.  This project can be used to see the most available cars at a location and also the various features that helps customers decide on a car.

Data Collection

The data was collected from the carmax website using scrapy and selenium. As of now, only the state of Virginia is used for data collection. All the sedans at a location in virginia is scraped and stored as csv files. Scrapy was used to loop each location and each car in the location. Selenium was used to make clicks to get used at each location, to click type and filter by sedans as well as to get all the cars in 25 mile radius. Below are the screens for each click.

  1. Click all 'Used cars at this location" for each of the 10 stores in Virginia.

screen1

2. Click on Type and then Sedan to filter only sedan at the location

screen2

3. Choose filter by distance to filter only cars at the store. Hence choose 25 miles

screen3

All the duplicate links to each of these cars are filtered by scrapy.

Data Analysis

a. Most available used car in Virginia

The data shows that the most popular used sedan in carmax across Virginia is Honda.  Toyota makes it as the 7th most available car and Nissan as the 4th most available. Also to note is that years 2012-2014 are most available on the lots in carmax. One possible reason for this can be that the warranty for the car has run out based on mileage or years.

popular_brand

b. Most popular features which attract customers

The word cloud displays popular features that can possibly attract more customers apart from the price and mileage.  As seen in the word cloud, "Auxiliary Audio Input" and "Cruise Control" are some of the attractive features apart from leather seats and rear view camera.

wordcloud

c.  Price range of sedans in Virginia

Most used sedans are evaluated between 10K - 20K in carmax across Virginia

price_range

d. Price range for used sedan based on the year

As seen in the plot, there is a larger price range for 2013 and 2016 used sedan. This large range can be due to more luxury cars available in the market. As luxury cars have only 4 years of warranty, 2013 cars maybe traded in more often than others thus resulting in larger marginal difference in price.

max_min_price

e.  Heatmap to display price range

The heat map can be used to display average price of each brand of car on a location. Year of car is not taken as a factor here which can significantly affect the price. However, we can still see that cars such as Audi can vary across locations, for example in Gaithesburg and other location. Similarly, cars such as Mazda have roughly the same average price point across all locations.

heatmap

f. Scatter plot for price vs mileage based on year of car
scatterplot_price

 

About Author

Avatar

Annie George

Annie George has more than a decade of experience using mainframe technology and databases such as DB2 and SQLServer to achieve results for organizations in the private sectors. Annie completed her Bachelors in Civil Engineering but she found...
View all posts by Annie George >

Related Articles

Leave a Comment

Avatar
Pavan April 5, 2018
Thanks for excellent information on web scraping this retail site. Would you be able to share the code with me?

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Spotlight alumni story Alumnus API artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Big Data bootcamp Bootcamp Prep Bundles California Cancer Research capstone Career citibike clustering Coding Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Industry Experts Job JP Morgan Chase Kaggle lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Open Data painter pandas Portfolio Development prediction Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest recommendation recommendation system regression Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Tableau Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping What to expect word cloud word2vec XGBoost yelp