Web Scraping Used Car For Sale

Bin Fang
Posted on Aug 22, 2016


10000 Used cars were searched with a 75-mile radius from 10 major cities of U.S.: Atlanta, GA, Chicago, IL, Washington, DC, Denver, CO, Houston, TX, Los Angeles, CA, Miami, FL, New York, NY, Phoenix, AZ, Seattle, WA. The cities were classified to two clusters: northern and southern cities for summarizing variables.

The website being scraped is http://www.carfax.com/, which is a commercial web-based service that supplies vehicle history reports to individuals and businesses on used cars and light trucks for consumers. The car variables include:  Price, Mileage, City, State, Engine, Transmission, Drive Type, Fuel Type, MPG City, MPG Highway, Exterior Color, Year, Make, Model.





This projects focuses on the following goals:

To study regional pattern of average price, mileage, MPG city/highway, engine type, drive type and exterior color.

To find out what are the most popular car makes/models/colors of the 10 cities.

To study relationship between price and mileage by city or car make.


Web Scraping

The web scraping was conducted using the beautiful soup tool and the results were visualized using R-studio. The code can be found at: https://github.com/nycdatasci/bootcamp006_project/tree/master/Project3-WebScraping/BinFang




Data Visualization and Analysis

Price vs. Mileage and MPG city vs. MPG highway

When comparing overall price with mileage, it can be noted that the two parameters demonstrate a different distribution pattern by city or region.  A clear inconsistency of this two parameters is found in New York and Miami, where the price is much higher than mileage. Subsequently, the overall mileage is also relatively lower in these two cities. From the plot summarized by region, it is found that the price is higher and mileage is lower is norther cities than southern cities. Such inconsistencies can also be observed from the boxplot. Regarding to the comparison between MPG city and MPG highway, there is no obvious variability between the two parameters for different cities.



Drive Type

It can be summarized that in some northern cities, such as New York, Chicago and Seattle, people prefer to drive all-wheel drive or four-wheel drive cars, as opposed to the southern cities, where people prefer to drive forward-wheel drive or rear-wheel drive cars. One possible reason could be that there is more snow in winter in north and the all-wheel drive or four-wheel drive cars are more adaptable to such weather.


Other summaries

Regarding to the engine type, it seems 4-cylinder is most common for all the cities. It also can be noted that in some big cities, such as Chicago and New York, people like to drive 10 or 12 cylinder cars.

For the three most popular car models/makes, take New York, Los Angeles and Miami for instance, it is found that Japanese car models (Nissan Altima, Toyota Camry and Corolla) are the most popular in New York, while American car models (Chevrolet Silverado, Ford F-150, Ford Mustang) are the top three car models in Miami. In Los Angeles, it can be summarized that both Japanese, Korean and German car models (Honda Accord, Kia Optima and Mercedes Benz E350) are the most popular. About the most popular colors, gray color is found in eight cities, followed by other most popular colors black and red.





Correlation Analysis between Price and Mileage

Three cities, Atlanta, Los Angeles and New York were picked out to analyze the spatial distribution variance of the scatters between price and mileage. It can be summarized that the mileage level for different cities is quite similar, but New York has the lowest over price, followed by Atlanta and Los Angeles, which indicates that New York has the best price-performance ratio.

If plot the price-mileage relationship by car manufacturers, it can be concluded that when under similar mileage level, Japanese car is the cheapest, as opposed to the most expensive car from Germany. American car is at medium price.



About Author

Bin Fang

Bin Fang

With a multi-disciplinary background in earth science, electrical engineering and satellite technology, Bin has spent more than ten years in scientific research and teaching in university and research institute. His previous study aimed to integrate and interpret remote...
View all posts by Bin Fang >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp