Using Machine Learning to Build a Predictive Model for House Prices

Introduction

The purpose of this project was to build a model to predict housing prices in Ames, Iowa based on a given data set with features such as total living area above ground, neighborhood, number of bathrooms, etc. We began our work by familiarizing ourselves with the dataset, addressing missing values and performing exploratory data analysis and visualization. We then implemented several machine learning techniques, including simple linear regression, lasso and ridge regularization, and tree-based methods including random forest and XGBoost. We evaluated our models by calculating the root mean squared error (RSME) between the predicted values and the observed sales price to see which model performed the best. Constructing a linear regression model with lasso regularization yielded the best result with a root mean squared logarithmic error (RMSLE) value of 0.12116, as calculated by Kaggle upon submission. This result represented the top 18th percentile at the time of submission.

Data and Feature Engineering

The data for this project was provided by Kaggle, and was pre-split into a training and testing set. Both data sets are near identical, with 1460 observations in the training and 1459 observations in the test set and 79 varying housing features to predict the overall sale price. Our first thought in predicting the sale price was that the location would have a heavy impact. The sale price per neighborhood is plotted below. We found that the extremes on both ends were highly important variables.

In moving forward with a linear model, the first assumption to test is normality in the data. In plotting the distribution of the target variable, sales price, we discovered that it was positively skewed, with a skewness of 1.88. After applying a log transformation to allow for higher predictive power, the skewness dropped to 0.12, as shown in the picture below.

Afterwards, we investigated the skewness across all numeric features to identify which variables would also need transforming to allow for a better fitting linear model.  Any variable with a skewness above 0.75 was log transformed.

The final aspect of our data cleanup and processing involved dealing with the missing values and the categorical features (e.g. which neighborhood the house belonged to or kitchen quality). The only numerical features that had significant amounts of missingness were the year that the garage was built and the lot frontage. Both were imputed with the median value for the respective variable. This was because the difference from a garage built in the year 1980 to the imputed year zero could cause significant issues with a linear model; the reasoning for lot frontage stemmed from the fact that if the house had a measurable square footage, it did not make sense to impute the missing values with zero. All of the other missing values were imputed with zero, as whatever was left we could gather was not a valid feature of the house, such as missing values for the pool or fence quality/square footage. The remaining categorical features were ‘dummified’ or one hot encoded to obtain numerical values in order to (initially) give the model we were developing as much information as possible.

Model Selection

Two families of predictive models were considered to solve the problem. The way we evaluated the models was by first splitting our training dataset into an 80/20% split, where 80% of the data was used to train the model. After training the model, the untouched 20% of the data was used to evaluate the predictions, directly assessing the model’s performance. The RMSE was evaluated between the predictions and actual values of sale price from the 20%. The first family, linear models, achieved high performance with Lasso feature selection technique and minimal feature engineering that addressed the skewness of predictors and dependent variable. Other attempts to improve the performance using Ridge regression produced identical results, however we observed a slightly higher RMSE for the Ridge regression model over the Lasso regression model. The tree-based models such as Random Forest, Extreme Gradient Boosting, and XGboost, produced inferior R-squared and were discarded from further analysis. Tree-based models would require subjective feature engineering and extensive hyperparameter-tuning to work well. With this dataset, of the original 79 features, 268 features were then generated after dummifying. This many features would make it extremely difficult to tune the number of features considered per split hyperparameter with tree-based models. Also, the number of training observations is relatively low (fewer than 1,500) which further hinders tree-models’ performance. Finally, the interpretability is lost for advanced tree-based models, which also makes the linear solution preferable.

Interpreting the Model

One of the main benefits gained by selecting a linear regression model is for its interpretability.  First, let’s examine the home features that were most important to the model. Note that the larger the magnitude of the variable’s coefficient in the model, the more influence it has on the sale price of the home.

Not surprisingly, total above-ground living area was the most influential value-add variable in the data set.  Another feature which showed up numerous times (as both a value-add and as a detractor to home value) is the neighborhood.  Tracking this back to the earlier boxplot (and recalling the mantra “location, location, location”) we see that our model, again not surprisingly, has indicated that location is a very important factor in determining sale price.  It is worth cautioning against directly extracting a dollar value for each of these potential changes in a home.

Since many of these features are on a log scale (done earlier to reduce skew) and normalized (given mean zero and adjusted for the variance), the marginal value increases in a home’s price based on these coefficients requires a calculation, but thankfully the model does exactly that.

Next Steps

There are many next steps one could take to attempt to achieve better results.  A sensible starting point would be to go after improving the results on the subsection of homes the algorithm performed worst on.  Below we plot our models predicted price home vs actual price home (this is, of course, on the training set as we never knew the home prices for the test set).

We can see that the model performed consistently worse for more expensive homes.  This implies that the effect of these variables on home price is not inherently linear and as such we need to augment the data (e.g. feature engineering) or model in order deal with these homes.  We could also augment the data set with a general “market” indicator (something like the S&P 500 Index as a rough proxy for the value of assets).

About Authors

Josh Vichare

BS in Materials Science & Engineering with a concentration on the study of Nanomaterials at Rutgers University. Josh has worked in the biomedical engineering field for close to 4 years in research and development, analyzing various performance metrics...
View all posts by Josh Vichare >

Jake Ralston

Jake has a Ph. D. in Mathematics from The University of Maryland, College Park. Originally an algebraic topologist, Jake worked as a trader at Bridgewater Associates for two years after completing his degree. He is now a data...
View all posts by Jake Ralston >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI