Data Study on car brand Preferences

Posted on Sep 13, 2017
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Buying a new car is a big and exciting step, especially when it is your first car. Research data(a new study to be published in the Journal of Industrial Economics) has shown that young people tend to be biased in favor of the car brand that their parents own.

This tendency might be caused by brand loyalty or the positive experience people had with the car brand; however, it could also be caused by the fact that they choose the familiar because they simply do not know how to decide among brands. Even those who actively pursue the option to buy a different brand and model would find most car purchasing websites not helpful, as they require that a search be based on a particular brand and model.

In order to remedy the problem of making a truly informed decision about buying one’s first car, the following research question was proposed: “To what extent can the car brand and model be recommended based on people’s car preferences?”



In order to answer this research question and assist people is choosing their dream car, there is a need for data that identifies the car brand and model that meet an individual’s specifications.

 Data Study on car brand Preferences

As such data is not publicly available, it can only be extracted from existing car sales websites that represent the cars currently on the market. Consequently, 13,000 cars, containing 20 brands and 37 car characteristics were scraped from a popular car sales website using Beautifulsoup in Python. As the cars on this website can be considered to be new cars, the recommendations the platform will present are limited to these cars and cannot recommend cars outside this sample. 

To ensure that the data that was collected could be used for further analysis, the data was manipulated within R, and missing values were imputed using the K nearest neighbor machine learning algorithm. Most of the missing values were found for fuel consumption (8.6% missing) and for the acceleration of the car (6.9% missing).


As mentioned earlier, these missing values were imputed using KNN, based on the car price, car brand and car type, using the Euclidean distance and a K equal the square root of n, resulting in a dataset without missing values. As KNN is an unsupervised machine learning algorithm, it is not possible to quantify its performance or accuracy, especially in a multidimensional solution space.

Analysis of the data indicates interesting relationships. For example, there’s a strong relationship between the car price and the engine size; cars with a higher price tend to have larger engines. Additionally, the data indicates that  expensive cars tend to have worse fuel mileage. Overall, the data seems to be strongly correlated over many variables, allowing the use of various machine learning algorithms.

 Data Study on car brand Preferences

Crosscorrelation plot for numeric variables


Machine Learning Algorithms

In order to predict features that might be of importance to potential car buyers and to make the recommendation platform independent of external data sources, specific car features will be predicted using machine learning algorithms. This holds most true for the car price, which is dependent on multiple variables, as indicated in the previous section.

Since the data was collected in a specific manner, (cars with the same make and model were located in the same section of the dataset, as they were scraped by order of brand) serial correlation was introduced into the dataset. In order to eliminate the serial correlation in the dataset, the sequencing of the dataset was randomized before the data was used within machine learning algorithms. Additionally, for validation purposes, the data was split in a training and testing data set using a 4 - 1 ratio.

Linear Regression Model

First, a multiple linear regression model (first model) was developed utilizing the features that exhibited high correlation with the car price, found in the EDA stage and presented in the previous section. An adjusted R² (calculated using the test data) of 0.899 was achieved. After performing the Breusch-Godfrey test on the estimated model,  serial correlation was detected due to the fact that the observation in the dataset were listed by car make and model.

In order to resolve this issue, the dataset was manipulated by the method mentioned in the introduction of this section. Next to serial correlation, the residuals were found to be heteroskedastic, implying that the level of variance in the residuals is unequal for varying car prices. Although  this is a violation of one of the Gauss Markov assumptions for best linear unbiased estimators, the review of the residual plot of the model did not identify high levels of heteroscedastic residuals, and the model is assumed to be appropriate.

Multiple Linear Regression

Second, a multiple linear regression model with forward stepwise selection was developed. This algorithm compared every possible model starting from the mean (a model with no independent variables) to a model that contained all the features and determined the best combination of features that returned the lowest Bayesian Information Criteria (BIC). An adjusted R² (calculated using test data) of 0.914 was achieved.

In order to further improve linear models, the Box-cox transformation was performed on the variable that represents the car price using a pre-determined lambda. This transformation mutates the data in order to make it approach a normal distribution. This model returned an adjusted R² of 0.9, which is not an improvement in comparison to the first, manually fitted multiple linear regression model.

Shrinkage/Regularization Method

Third, a shrinkage/regularization method was deployed to improve upon the results from the previous presented models. Traditional shrinkage methods (Lasso and Ridge regression) have drawbacks with regards to datasets that contain multiple variables that are highly correlated with the dependent variable. If Lasso regression would be deployed, the majority of the coefficients would be reduced to zero, while they could have high explanatory power. The Elastic Net model can overcome the aforementioned issue by introducing an additional hyperparameter which balances between ridge and lasso regression.

Additionally, a second hyperparameter that determines the balance between the reduction of the MSE and complexity of the model, was used. In order to determine the best hyperparameters to minimize the MSE while retaining a simplistic model,  10-fold cross validation was performed. The  adjusted R² obtained from this model is 0.9218.

Gradient Boosting Model

Fourth and last, a gradient boosting model (GBM) was utilized to further improve the prediction accuracy. The model was constructed with 7000 decision trees, an  interaction depth that equals 4, and a shrinkage factor of 0.1, which were all determined by means of cross validation of the hyperparameters. With an astonishing adjusted R² of 0.9569, the GBM model appeared have the highest prediction power comparing to all previous models mentioned.  

Overall it is possible to conclude that even though the GBM model outperforms all other models, there is always a preference for simple models. Therefore, since the simple multiple linear regression model performs very well, the simple linear regression model will be deployed in the car recommendation platform.


Car Recommendations

Being able to predict the car price and other interesting features about the car, creates the opportunity to present more information to the user when recommendations are made. However, at this point we have not yet explored how car models are recommended, the topic of this section.

Recommendations in general are based on the assumption that users with same preferences will rate items similarly. This suggests that if someone likes all the features of a specific car, he or she most likely will like the car as a whole. In order to find a car that is in line with the desires of a particular user, one has to find users with the same desires and aggregate the types of cars these users have.

The comparison between these users and the user for which the recommendation is made, is based on the k nearest users, where the distance is determined using the Pearson correlation coefficient or the Cosine similarity. Once the users have been identified, their ratings are aggregated, and the recommendation is made for the new user. This technique is referred to as User Based Collaborative Filtering.


User Based Collaborative Filtering is a semi-supervised learning technique that uses specific entries in a training rating matrix to determine unspecified entries in a testing rating matrix. Consequently, it is possible to determine the accuracy of the recommendation. In order to evaluate the recommendations, 10 folds cross-validation with 7 given items was performed, where the performance can be investigated by means of the ROC and Precision/Recall graphs.

The ROC curve indicates the relation between the true positive predictions (y axis) and the false positive predictions (x axis). The results from the ROC curve indicate that the majority of the recommendations are correct, while only having a small number of false positives (wrong predictions). Overall, the ROC curve indicates that the model predicts accurately as the ROC curve approaches the top left corner, where perfect models fit exactly in the top left corner.

ROC Plot

The Precision/Recall curve indicates the relation between the precision (how useful the search results are) and the recall (how complete the results are). It is possible to infer from the graph that for small number of recommendations the precision is very high; however, with increasing recall, the precision tends to decline towards zero. Overall, it can be concluded that the model performs well as the curve tend towards the right top corner of the graph, which represents a perfect model. Furthermore, the drop in precision is prevented by limiting the recall to a maximum of 10 recommendations.

 Data Study on car brand Preferences

Precision Recall Plot

Based on this model and the desired car specifications, 10 car brand, including the car model are recommended to the user.


Shiny Data Application

By combining the recommendation model with the machine learning models it is possible to build a user interface that presents car recommendations based on user inputs, as for example the desired engine horsepower or car type. These user inputs can be controlled manually through drop down menus and checkboxes, for specific car options. Otherwise, it is possible to describe your dream car (limited to certain characteristics) in one sentence and allow the recommendation model to determine 10 possible cars. The recommended cars are presented one by one with a car price, emission level, fuel consumption indication, and image for easy comparison.

When a user likes a car, he or she can use the like button, which will redirect the user to a website where the car can be purchased. Additionally, the information that was used for the recommendation and the “liked” car are stored externally of the application and used for future recommendations. The application can be found here:


In conclusion, the recommendation application employs user based collaborative filtering and regression techniques in order to precisely recommend the car brand and model to users, based on specific car characteristics. Performance verification has proven that both the recommendation algorithm and regression models have good performance, resulting in a trustworthy recommendation platform.  


About Authors

Steven Jongerden

Steven graduated summa cum laude from the Delft University of Technology with a Masters degree in Engineering and Policy Analysis and a Bachelors degree in Aerospace Engineering. He is currently a Data Science Consultant employed by Capgemini Netherlands....
View all posts by Steven Jongerden >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI