Used car price prediction with R

Posted on Jan 25, 2022

The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Objective

  • This project was to find a multiple linear regression model by using R from a given used car price data and predict a used car price on the basis of the test data. The data was from one of Kaggle's datasets and is available here. The selling price was the target variable and, other variables were used for features.

Data

  • EDA(Exploratory Data Analysis) was applied to the data to observe any outliers and missing values. Then, preprocessing was performed on the data.

    figure 1: preprocessed data

  • After preprocessing the data, I studied the visualization for correlation between 'selling price' and other features. This link
  • here also shows the visualization graphs.

Modeling

  • Before finding a model, the nominal features had to be dummified. Then, the data were split into train-test with a 70%-30% ratio to generate multiple regression models. The first model that contained all features had high VIF scores that were larger than 5 with some variables as shown in figure 2. The high VIF score variab s were eng_cc, fuel_Diesel, and fuel_Petrol. There are various ways to deal with these high VIF score features such as removing those features in the modeling process. However, fuel type is one of the most important factors when it comes to selling/buying a used car. Since figure 3 indicates that fuel_Diesel has a big relationship with fuel_Petrol and eng_cc, I tried to change the fuel_Diesel to fuel_CNG for the dummifing step and removed the eng_cc.

    figure 2: VIF scores with all features

figure 3: correlation of fuel_Diesel

  • As a result, the second model shows no features that have high VIF score as shown in figure 4. However, there are still bad features to the model that have p-value higher than 0.05 as shown in figure 5. Therefore, the feature, 'mil_km', had been removed in the model. The final model can be found in figure 6.

figure 4: VIF scores with fuel CNG

figure 5: summary for new dummified model

figure 6: summary for the final model

Result

  • After AIC/BIC comparison, the final model was confirmed as a good model to use. So, the trained model for selling price could be predicted with features of year, km_driven, seats, max_pow, fuel_CNG, fuel_Petrol, seller_type_Individual, transmission_Manual, and owner types. The next step was to see how well the final model predicted the test data with a standard error of 0.379, the predicted r square score was 0.868. This means that the model can explain the test data with 86.8% variance and standard error from the actual value was 0.379. Some helpful figures are shown below.

Future

  • Even though 86.8% for predicted r squared value is not bad, it's difficult to say that it can give a reliable prediction still and there would be a better prediction model. Therefore, using a different model such as Random Forest Regressor may produce an improved result.

About Author

Jungu Kang

Passionate to challenge problems with certification as a Data Scientist and with experience in engineering background and project management in the food industry. Detail-oriented, eclectic, industrious, easy-going.
View all posts by Jungu Kang >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI