Used car price prediction with R

Jungu Kang

Posted on Jan 25, 2022

The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Objective

This project was to find a multiple linear regression model by using R from a given used car price data and predict a used car price on the basis of the test data. The data was from one of Kaggle's datasets and is available here. The selling price was the target variable and, other variables were used for features.

Data

EDA(Exploratory Data Analysis) was applied to the data to observe any outliers and missing values. Then, preprocessing was performed on the data.

figure 1: preprocessed data
After preprocessing the data, I studied the visualization for correlation between 'selling price' and other features. This link
here also shows the visualization graphs.

Modeling

Before finding a model, the nominal features had to be dummified. Then, the data were split into train-test with a 70%-30% ratio to generate multiple regression models. The first model that contained all features had high VIF scores that were larger than 5 with some variables as shown in figure 2. The high VIF score variab s were eng_cc, fuel_Diesel, and fuel_Petrol. There are various ways to deal with these high VIF score features such as removing those features in the modeling process. However, fuel type is one of the most important factors when it comes to selling/buying a used car. Since figure 3 indicates that fuel_Diesel has a big relationship with fuel_Petrol and eng_cc, I tried to change the fuel_Diesel to fuel_CNG for the dummifing step and removed the eng_cc.

figure 2: VIF scores with all features

figure 3: correlation of fuel_Diesel

As a result, the second model shows no features that have high VIF score as shown in figure 4. However, there are still bad features to the model that have p-value higher than 0.05 as shown in figure 5. Therefore, the feature, 'mil_km', had been removed in the model. The final model can be found in figure 6.

figure 4: VIF scores with fuel CNG

trained model with new fuel dummify | Data Science Blog

figure 5: summary for new dummified model

trained model with final | Data Science Blog

figure 6: summary for the final model

Result

After AIC/BIC comparison, the final model was confirmed as a good model to use. So, the trained model for selling price could be predicted with features of year, km_driven, seats, max_pow, fuel_CNG, fuel_Petrol, seller_type_Individual, transmission_Manual, and owner types. The next step was to see how well the final model predicted the test data with a standard error of 0.379, the predicted r square score was 0.868. This means that the model can explain the test data with 86.8% variance and standard error from the actual value was 0.379. Some helpful figures are shown below.

Future

Even though 86.8% for predicted r squared value is not bad, it's difficult to say that it can give a reliable prediction still and there would be a better prediction model. Therefore, using a different model such as Random Forest Regressor may produce an improved result.

About Author

Jungu Kang

Passionate to challenge problems with certification as a Data Scientist and with experience in engineering background and project management in the food industry. Detail-oriented, eclectic, industrious, easy-going.

View all posts by Jungu Kang >

No comments found.

Used car price prediction with R

The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Objective

About Author

Jungu Kang

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Used car price prediction with R

The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Objective

About Author

Jungu Kang

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!