Using Machine Learning to Build a Predictive Model for House Prices
Introduction
The purpose of this project was to build a model to predict housing prices in Ames, Iowa based on a given data set with features such as total living area above ground, neighborhood, number of bathrooms, etc. We began our work by familiarizing ourselves with the dataset, addressing missing values and performing exploratory data analysis and visualization. We then implemented several machine learning techniques, including simple linear regression, lasso and ridge regularization, and tree-based methods including random forest and XGBoost. We evaluated our models by calculating the root mean squared error (RSME) between the predicted values and the observed sales price to see which model performed the best. Constructing a linear regression model with lasso regularization yielded the best result with a root mean squared logarithmic error (RMSLE) value of 0.12116, as calculated by Kaggle upon submission. This result represented the top 18th percentile at the time of submission.
Data and Feature Engineering
The data for this project was provided by Kaggle, and was pre-split into a training and testing set. Both data sets are near identical, with 1460 observations in the training and 1459 observations in the test set and 79 varying housing features to predict the overall sale price. Our first thought in predicting the sale price was that the location would have a heavy impact. The sale price per neighborhood is plotted below. We found that the extremes on both ends were highly important variables.
In moving forward with a linear model, the first assumption to test is normality in the data. In plotting the distribution of the target variable, sales price, we discovered that it was positively skewed, with a skewness of 1.88. After applying a log transformation to allow for higher predictive power, the skewness dropped to 0.12, as shown in the picture below.
Afterwards, we investigated the skewness across all numeric features to identify which variables would also need transforming to allow for a better fitting linear model. Any variable with a skewness above 0.75 was log transformed.
The final aspect of our data cleanup and processing involved dealing with the missing values and the categorical features (e.g. which neighborhood the house belonged to or kitchen quality). The only numerical features that had significant amounts of missingness were the year that the garage was built and the lot frontage. Both were imputed with the median value for the respective variable. This was because the difference from a garage built in the year 1980 to the imputed year zero could cause significant issues with a linear model; the reasoning for lot frontage stemmed from the fact that if the house had a measurable square footage, it did not make sense to impute the missing values with zero. All of the other missing values were imputed with zero, as whatever was left we could gather was not a valid feature of the house, such as missing values for the pool or fence quality/square footage. The remaining categorical features were ‘dummified’ or one hot encoded to obtain numerical values in order to (initially) give the model we were developing as much information as possible.
Model Selection
Two families of predictive models were considered to solve the problem. The way we evaluated the models was by first splitting our training dataset into an 80/20% split, where 80% of the data was used to train the model. After training the model, the untouched 20% of the data was used to evaluate the predictions, directly assessing the model’s performance. The RMSE was evaluated between the predictions and actual values of sale price from the 20%. The first family, linear models, achieved high performance with Lasso feature selection technique and minimal feature engineering that addressed the skewness of predictors and dependent variable. Other attempts to improve the performance using Ridge regression produced identical results, however we observed a slightly higher RMSE for the Ridge regression model over the Lasso regression model. The tree-based models such as Random Forest, Extreme Gradient Boosting, and XGboost, produced inferior R-squared and were discarded from further analysis. Tree-based models would require subjective feature engineering and extensive hyperparameter-tuning to work well. With this dataset, of the original 79 features, 268 features were then generated after dummifying. This many features would make it extremely difficult to tune the number of features considered per split hyperparameter with tree-based models. Also, the number of training observations is relatively low (fewer than 1,500) which further hinders tree-models’ performance. Finally, the interpretability is lost for advanced tree-based models, which also makes the linear solution preferable.
Interpreting the Model
One of the main benefits gained by selecting a linear regression model is for its interpretability. First, let’s examine the home features that were most important to the model. Note that the larger the magnitude of the variable’s coefficient in the model, the more influence it has on the sale price of the home.
Not surprisingly, total above-ground living area was the most influential value-add variable in the data set. Another feature which showed up numerous times (as both a value-add and as a detractor to home value) is the neighborhood. Tracking this back to the earlier boxplot (and recalling the mantra “location, location, location”) we see that our model, again not surprisingly, has indicated that location is a very important factor in determining sale price. It is worth cautioning against directly extracting a dollar value for each of these potential changes in a home.
Since many of these features are on a log scale (done earlier to reduce skew) and normalized (given mean zero and adjusted for the variance), the marginal value increases in a home’s price based on these coefficients requires a calculation, but thankfully the model does exactly that.
Next Steps
There are many next steps one could take to attempt to achieve better results. A sensible starting point would be to go after improving the results on the subsection of homes the algorithm performed worst on. Below we plot our models predicted price home vs actual price home (this is, of course, on the training set as we never knew the home prices for the test set).
We can see that the model performed consistently worse for more expensive homes. This implies that the effect of these variables on home price is not inherently linear and as such we need to augment the data (e.g. feature engineering) or model in order deal with these homes. We could also augment the data set with a general “market” indicator (something like the S&P 500 Index as a rough proxy for the value of assets).