Popularity and Price Data evaluation of used cars

Posted on Aug 28, 2017
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Carmax is the largest used cars retailer in the United States with stores across the country. Thousands of cars are found in each of these stores and they are traded by buyers and sellers every day.  This data study will help those who are planning to buy a used car and intends to know a precise price range for the used cars based on mileage and year of the car. As is known, the price can vary drastically for a car with the same characteristics from location to location.  This project can be used to see the most available cars at a location and also the various features that helps customers decide on a car.

Data Collection

The data was collected from the carmax website using scrapy and selenium. As of now, only the state of Virginia is used for data collection. All the sedans at a location in virginia is scraped and stored as csv files. Scrapy was used to loop each location and each car in the location. Selenium was used to make clicks to get used at each location, to click type and filter by sedans as well as to get all the cars in 25 mile radius. Below are the screens for each click.

  1. Click all 'Used cars at this location" for each of the 10 stores in Virginia.

Popularity and Price Data evaluation of used cars

2. Click on Type and then Sedan to filter only sedan at the location

Popularity and Price Data evaluation of used cars

3. Choose filter by distance to filter only cars at the store. Hence choose 25 miles

Popularity and Price Data evaluation of used cars

All the duplicate links to each of these cars are filtered by scrapy.

Analysis

a. Most available used car in Virginia

The data shows that the most popular used sedan in carmax across Virginia is Honda.  Toyota makes it as the 7th most available car and Nissan as the 4th most available. Also to note is that years 2012-2014 are most available on the lots in carmax. One possible reason for this can be that the warranty for the car has run out based on mileage or years.

popular_brand

b. Most popular features which attract customers

The word cloud displays popular features that can possibly attract more customers apart from the price and mileage.  As seen in the word cloud, "Auxiliary Audio Input" and "Cruise Control" are some of the attractive features apart from leather seats and rear view camera.

wordcloud

c.  Price range of sedans in Virginia

Most used sedans are evaluated between 10K - 20K in carmax across Virginia

price_range

d. Price range for used sedan based on the year

As seen in the plot, there is a larger price range for 2013 and 2016 used sedan. This large range can be due to more luxury cars available in the market. As luxury cars have only 4 years of warranty, 2013 cars maybe traded in more often than others thus resulting in larger marginal difference in price.

max_min_price

e.  Heatmap to display price range

The heat map can be used to display average price of each brand of car on a location. Year of car is not taken as a factor here which can significantly affect the price. However, we can still see that cars such as Audi can vary across locations, for example in Gaithesburg and other location. Similarly, cars such as Mazda have roughly the same average price point across all locations.

heatmap

f. Scatter plot for price vs mileage based on year of car
scatterplot_price

 

About Author

Annie George

Annie George has more than a decade of experience using mainframe technology and databases such as DB2 and SQLServer to achieve results for organizations in the private sectors. Annie completed her Bachelors in Civil Engineering but she found...
View all posts by Annie George >

Related Articles

Leave a Comment

Pavan April 5, 2018
Thanks for excellent information on web scraping this retail site. Would you be able to share the code with me?

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI