Data Web Scraping Used Car For Sale

Posted on Aug 22, 2016
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


Data of 10000 Used cars were searched with a 75-mile radius from 10 major cities of U.S.: Atlanta, GA, Chicago, IL, Washington, DC, Denver, CO, Houston, TX, Los Angeles, CA, Miami, FL, New York, NY, Phoenix, AZ, Seattle, WA. The cities were classified to two clusters: northern and southern cities for summarizing variables.

The website being scraped is, which is a commercial web-based service that supplies vehicle history reports to individuals and businesses on used cars and light trucks for consumers. The car variables include: Β Price, Mileage, City, State, Engine, Transmission, Drive Type, Fuel Type, MPG City, MPG Highway, Exterior Color, Year, Make, Model.

Data Web Scraping Used Car For Sale




This projects focuses on the following goals:

To study regional pattern of average price, mileage, MPG city/highway, engine type, drive type and exterior color.

To find out what are the most popular car makes/models/colors of the 10 cities.

Lastly, to study relationship between price and mileage by city or car make.


Data Web Scraping

The web scraping was conducted using the beautiful soup tool and the results were visualized using R-studio. The code can be found at:




Data Visualization and Analysis

Price vs. Mileage and MPG city vs. MPG highway

When comparing overall price with mileage, it can be noted that the two parameters demonstrate a different distribution pattern by city or region. Β A clear inconsistency of this two parameters is found in New York and Miami, where the price is much higher than mileage. Subsequently, the overall mileage is also relatively lower in these two cities. From the plot summarized by region, it is found that the price is higher and mileage is lower is norther cities than southern cities. Such inconsistencies can also be observed from the boxplot. Regarding to the comparison between MPG city and MPG highway, there is no obvious variability between the two parameters for different cities.



Drive Type

It can be summarized that in some northern cities, such as New York, Chicago and Seattle, people prefer to drive all-wheel drive or four-wheel drive cars, as opposed to the southern cities, where people prefer to drive forward-wheel drive or rear-wheel drive cars. One possible reason could be that there is more snow in winter in north and the all-wheel drive or four-wheel drive cars are more adaptable to such weather.


Other summaries

Regarding to the engine type, it seems 4-cylinder is most common for all the cities. It also can be noted that in some big cities, such as Chicago and New York, people like to drive 10 or 12 cylinder cars.

For the three most popular car models/makes, take New York, Los Angeles and Miami for instance, it is found that Japanese car models (Nissan Altima, Toyota Camry and Corolla) are the most popular in New York, while American car models (Chevrolet Silverado, Ford F-150, Ford Mustang) are the top three car models in Miami. In Los Angeles, it can be summarized that both Japanese, Korean and German car models (Honda Accord, Kia Optima and Mercedes Benz E350) are the most popular. About the most popular colors, gray color is found in eight cities, followed by other most popular colors black and red.





Correlation Analysis between Price and Mileage

Three cities, Atlanta, Los Angeles and New York were picked out to analyze the spatial distribution variance of the scatters between price and mileage. It can be summarized that the mileage level for different cities is quite similar, but New York has the lowest over price, followed by Atlanta and Los Angeles, which indicates that New York has the best price-performance ratio.

If plot the price-mileage relationship by car manufacturers, it can be concluded that when under similar mileage level, Japanese car is the cheapest, as opposed to the most expensive car from Germany. American car is at medium price.



About Author

Bin Fang

With a multi-disciplinary background in earth science, electrical engineering and satellite technology, Bin has spent more than ten years in scientific research and teaching in university and research institute. His previous study aimed to integrate and interpret remote...
View all posts by Bin Fang >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI