Used car price prediction with R
The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
- This project was to find a multiple linear regression model by using R from a given used car price data and predict a used car price on the basis of the test data. The data was from one of Kaggle's datasets and is available here. The selling price was the target variable and, other variables were used for features.
- EDA(Exploratory Data Analysis) was applied to the data to observe any outliers and missing values. Then, preprocessing was performed on the data.
- After preprocessing the data, I studied the visualization for correlation between 'selling price' and other features. This link
- here also shows the visualization graphs.
- Before finding a model, the nominal features had to be dummified. Then, the data were split into train-test with a 70%-30% ratio to generate multiple regression models. The first model that contained all features had high VIF scores that were larger than 5 with some variables as shown in figure 2. The high VIF score variab s were eng_cc, fuel_Diesel, and fuel_Petrol. There are various ways to deal with these high VIF score features such as removing those features in the modeling process. However, fuel type is one of the most important factors when it comes to selling/buying a used car. Since figure 3 indicates that fuel_Diesel has a big relationship with fuel_Petrol and eng_cc, I tried to change the fuel_Diesel to fuel_CNG for the dummifing step and removed the eng_cc.
figure 3: correlation of fuel_Diesel
- As a result, the second model shows no features that have high VIF score as shown in figure 4. However, there are still bad features to the model that have p-value higher than 0.05 as shown in figure 5. Therefore, the feature, 'mil_km', had been removed in the model. The final model can be found in figure 6.
- After AIC/BIC comparison, the final model was confirmed as a good model to use. So, the trained model for selling price could be predicted with features of year, km_driven, seats, max_pow, fuel_CNG, fuel_Petrol, seller_type_Individual, transmission_Manual, and owner types. The next step was to see how well the final model predicted the test data with a standard error of 0.379, the predicted r square score was 0.868. This means that the model can explain the test data with 86.8% variance and standard error from the actual value was 0.379. Some helpful figures are shown below.
- Even though 86.8% for predicted r squared value is not bad, it's difficult to say that it can give a reliable prediction still and there would be a better prediction model. Therefore, using a different model such as Random Forest Regressor may produce an improved result.