Web Scraping Used Car For Sale
10000 Used cars were searched with a 75-mile radius from 10 major cities of U.S.: Atlanta, GA, Chicago, IL, Washington, DC, Denver, CO, Houston, TX, Los Angeles, CA, Miami, FL, New York, NY, Phoenix, AZ, Seattle, WA. The cities were classified to two clusters: northern and southern cities for summarizing variables.
The website being scraped is http://www.carfax.com/, which is a commercial web-based service that supplies vehicle history reports to individuals and businesses on used cars and light trucks for consumers. The car variables include: Price, Mileage, City, State, Engine, Transmission, Drive Type, Fuel Type, MPG City, MPG Highway, Exterior Color, Year, Make, Model.
This projects focuses on the following goals:
To study regional pattern of average price, mileage, MPG city/highway, engine type, drive type and exterior color.
To find out what are the most popular car makes/models/colors of the 10 cities.
To study relationship between price and mileage by city or car make.
The web scraping was conducted using the beautiful soup tool and the results were visualized using R-studio. The code can be found at: https://github.com/nycdatasci/bootcamp006_project/tree/master/Project3-WebScraping/BinFang
Data Visualization and Analysis
Price vs. Mileage and MPG city vs. MPG highway
When comparing overall price with mileage, it can be noted that the two parameters demonstrate a different distribution pattern by city or region. A clear inconsistency of this two parameters is found in New York and Miami, where the price is much higher than mileage. Subsequently, the overall mileage is also relatively lower in these two cities. From the plot summarized by region, it is found that the price is higher and mileage is lower is norther cities than southern cities. Such inconsistencies can also be observed from the boxplot. Regarding to the comparison between MPG city and MPG highway, there is no obvious variability between the two parameters for different cities.
It can be summarized that in some northern cities, such as New York, Chicago and Seattle, people prefer to drive all-wheel drive or four-wheel drive cars, as opposed to the southern cities, where people prefer to drive forward-wheel drive or rear-wheel drive cars. One possible reason could be that there is more snow in winter in north and the all-wheel drive or four-wheel drive cars are more adaptable to such weather.
Regarding to the engine type, it seems 4-cylinder is most common for all the cities. It also can be noted that in some big cities, such as Chicago and New York, people like to drive 10 or 12 cylinder cars.
For the three most popular car models/makes, take New York, Los Angeles and Miami for instance, it is found that Japanese car models (Nissan Altima, Toyota Camry and Corolla) are the most popular in New York, while American car models (Chevrolet Silverado, Ford F-150, Ford Mustang) are the top three car models in Miami. In Los Angeles, it can be summarized that both Japanese, Korean and German car models (Honda Accord, Kia Optima and Mercedes Benz E350) are the most popular. About the most popular colors, gray color is found in eight cities, followed by other most popular colors black and red.
Correlation Analysis between Price and Mileage
Three cities, Atlanta, Los Angeles and New York were picked out to analyze the spatial distribution variance of the scatters between price and mileage. It can be summarized that the mileage level for different cities is quite similar, but New York has the lowest over price, followed by Atlanta and Los Angeles, which indicates that New York has the best price-performance ratio.
If plot the price-mileage relationship by car manufacturers, it can be concluded that when under similar mileage level, Japanese car is the cheapest, as opposed to the most expensive car from Germany. American car is at medium price.