Using Data to Predict the Housing Market in Ames, Iowa
The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Based on data, housing is the fount of middle-class wealth accumulation. Owning a house is not only a dream for millions of Americans but one of the most expensive purchases they'll ever make. From these aspirations arises one of the most powerful engines of modern economy: the US housing market. Its collective fluctuation of housing prices augurs times good and bad.
Putting machine learning to housing data is therefore a worthwhile task. We do so with the intent of predicting, as a prototypical simulation, the housing market in Ames, Iowa.
The code for this project can be found here.
Dataset
Our dataset describes the sales of 2,919 residential properties in Ames, Iowa from 2006 to 2010. Our exercise is twofold: to build a model with high predictive accuracy of sales prices; and to determine which, among the 79 explanatory variables given, have the most predictive power. By splitting our dataset in half, we obtained 1,460 observations on which to tune our model and 1,459 observations on which to test its accuracy.
Data Exploration
As we scoped across our variables for outlying observations, we identified an egregious violation of a generally linear relationship between sale price and square footage.
Two of the largest houses were sold well below-price, likely reflecting "special relationships" in play between sellers and buyers — friends and family, perhaps. We removed the outliers to avoid skewing our predictions.
Of especial concern are the null values in our data. There are 35 columns with missing data, with four different types of missingness.
Much of that missingness is non-existence, in fact. That a house lacks a fireplace or a swimming pool is distinct from the usual concepts of missingness, where existing values go unrecorded due to statistical error. 24 columns are documented for non-existent values, which are sufficiently imputed with 'None' or '0' depending on the format.
We imputed several categorical features via the mode, as the distribution of existing values in those features were dominated by those single elements. Other categoricals, with more evenly distributed possibilities, were imputed randomly in proportion to those potential candidates.
Lot frontage, the strip of land adjacent to public roads, is likely to scale with the size of a property in its given neighborhood. We imputed its missing values with the medians of the associated neighborhoods.
We felt it reasonably safe to assume any existing garages with missing construction dates were built in the same years as their original houses.
Last, upon examining the utilities column, we found only one observation that deviated from the modal value; and our test set not to deviate at all. We dropped this uninformative feature from our dataset.
Feature Engineering
Although several features exist to list the square footage of various parts of the house, no single column expresses total square footage. This seemed a major oversight, and so we engineered such a feature.
A cursory glance shows improved correlation to sale price over its component features.
In addition to total square footage, we added an annual inflation index to normalise house prices from different years.
A few numerical features are suspected to contain ordinality — for instance YrSold as some years, given the housing cycle, should make more attractive prices to buy or sell. To prepare these features for encoding, we converted them to string values.
We label encoded our ordinal features into numeric rankings of [0, 1, 2, ...]. For those categorical features without intrinsic ordering, we chose to one-hot encode them. Although this expanded out our feature space considerably to 221 total features, we needed our variables to be coded numerically in order to proceed with the modeling.
We standardised our dataset to improve numeric stability and enable comparisons across multiple explanatory features.
Last, we identified all features with a distributional skewness >1.5, as computed by the Fischer-Pearson coefficient of skewness. To maximise the predictive power of our model, we normalised the distribution of these features with Box-Cox transformations.
Modelling
We tuned a variety of models on our dataset, selecting hyper-parameters amongst each for the lowest cross-validated root mean-squared error (RMSE) ranging from 0 to 1.0.
1) Elastic-Net
Elastic-net is a form of regularised linear regression that naturally selects for significant features. It relies on an α-parameter, which directly penalises estimator coefficients, and a L1-ratio parameter, which determines the balance between the elimination of non-significant features versus the mere reduction of coefficient size. Tuning elastic-net gave us an RMSE of 0.1110.
This is a relatively good result and proof of the strength to which our predictions depend on linear variables. The central issue with linear regression, however, is its inability to capture the characteristics of other, nonlinear features.
2) Kernel Ridge Regression
We next turned to kernel ridge regression (KRR). The idea here is to employ a flexible set of nonlinear prediction functions, modulated by a penalty term to avoid overfitting. The kernel essentially defines a higher-dimensional inner-product subspace on which to more easily identify linear relationships within nonlinear neighborhoods.
A choice of polynomial kernel allows us to generalise out from purely linear space while still retaining the linear option. After substantive tuning, however, we found no improvement over elastic-net, given an RMSE of 0.1115.
3) Gradient Boosting
We resorted to non-parametric modeling. Gradient boosted decision trees (GBDT) are a powerful set of models that, by iterating over sequential subtrees to minimise a loss function, sort through the hierarchy of feature space to learn linear, nonlinear, and interactive features.
Stochastic gradient boosting adds the element of subsampling, stochastically and without replacement, the training data in fitting the base learner trees. The benefit of the stochastic process over normal gradient boosting is a reduction in variance, as subtrees are decorrelated from one another. Tuning for GBDTs is considerably more involved than in our previous models. Nevertheless, we found improvement with an RMSE of 0.1086.
4) XGBoost
Taking our progress with decision trees a step further, we implemented XGBoost, which introduces a regularising γ-parameter to control the complexity of tree partition. The higher the γ, the larger the minimum loss threshold to split the tree at each leaf node.
XGBoost gave an RSME of 0.0487. This much superior result raised our suspicions, provoking us to apply it to the test set, where it outputted RSME 0.1156. The large difference in error made a clear case of severe overfitting.
Given the breadth of hyperparameter space to search through and the high computational costs of doing so, we most assuredly did not find the optimal parameters for XGBoost. As we proceeded to our final step, in light of time constraints, we chose to put improvements on XGBoost aside for future consideration.
5) Model Ensembling
Model ensembling is a powerful technique for improving prediction. Much as how GBDTs iteratively improve on an ensemble of base learner trees, so can we stack multiple base models to achieve superior performance.
There are a few schools of thought on ensembling. The general theme is to incorporate the predictions of the base models as additional features into the training set, onto which a stacking model will fit.
This meta-model weighs the predictions in order to highlight the strengths of the respective base models and smooth out their weaknesses. A diverse selection of base models, sufficiently decorrelated as to capture linear, hierarchical, and nonlinear relationships in our data, is therefore desirable.
Which predictions to include as new meta-features is a matter of method. Test predictions generated from the cross-validating process may be included; more thoroughly, so might test predictions from fitting off the entire training dataset.
We tune our meta-model for its optimal hyper-parameters by fitting over the complete set of meta-data, using the same cross-validation folds as those that generated our meta-features. In theory this should introduce some overfitting as a result of target leakage, since our meta-features are themselves derived off target values in the corresponding cross-validation folds; but in practice the effect is so minimal as to be negligible.
At the last, our meta-model makes a final test prediction.
Results & Feature Importance
Lasso regression was our choice of meta-model; and elastic-net, KRR, and stochastic GBDT its base models. Altogether we achieved an improved train RSME of 0.0689 and a test RSME of 0.1086.
In a final gambit, we averaged the results of our stacked meta-model with those from our XGBoost, to test if naively-weighted diversification would improve our predictions. The weights we gave to each model were proportioned to their test RSMEs.
This did, indeed, result in our best official prediction score, which is computed by Kaggle on a slightly different error metric of RMSLE: 0.1202.
The top ten features from our model are as follows:
There is some redundancy in our rankings, as GrLivArea and TotalBsmtSF roughly approximate to TotalSF. On the whole, there are few surprises.
Sales prices are most heavily influenced by overall perceived quality and total square footage. The more attractive and well-kept an exterior facade, the greater the likely sale. The year in which a house was built may determine its construction style or the likelihood to which it remains well-maintained. That latter issue of timely maintenance correlates strongly with any recent remodeling effort. And, of course, amenities such as garage space and number of full bathrooms are important to the typical buyer.
We recommend any aspiring real estate agent to target the properties that maximally exploit these variables, at least to the extent that each neighborhood permits.
Future Work
We exercised a variety of statistical learning techniques, examined their efficacies, and designed a stacked model to boost predictive performance. Still, there are a few areas we can think of for improvement.
The selection, and tuning, of a meta-model and its constituents is more art than science. We could experiment with a greater breadth of algorithms to ensemble more uniquely diversified base models, each of which we would better know to capture decorrelated characteristics in our data.
Similarly, we might experiment with the choice of meta-model.
One non-standard model of interest is Catboost. Catboost has a reputation for convenient handling of, and highly accurate predictions on, categorical data. Significantly, it implements ordered boosting to train multiple models on ordered permutations of each stochastic subsampling of data, in order to reduce bias in our gradient boosting procedure. This exponential increase in modeling results in much larger computational training costs; but acceptably so for a small dataset such as ours.
Our reliance on decision trees such as stochastic GBDT, XGBoost, and the aforementioned Catboost results in enormous hyper-parameter search spaces. Even with intelligently iterative tuning, we are unlikely to be sure of finding the best set of hyper-parameters using basic grid and randomised search techniques.
We should, therefore, implement Bayesian optimisation (BO). BO updates a naive prior, our surrogate function, with multiple iterations of locally-optimised sampling, in order to more accurately understand the statistical reality of our search space. By maximising our acquisition function, which directs us to the next-best sampling location based off our updated priors, we weigh exploration of the probability space beyond our surrogate function against exploitation of the probability space within our surrogate function. In this way, BO tries to find the best hyper-parameters using the fewest steps possible.
Our reliance on one-hot encoding severely dilutes the informative value of our encoded categories. Our models would benefit from alternate coding strategies that limit this expansion of feature space. One thereafter expects, for instance, neighborhood to be among the top determinants of sales price.
Last, and perhaps of greatest importance, are feature interactions. We suspect there are significant features, with strong correlations to sales price, to be engineered but require domain knowledge or corresponding data from outside sources. Not being domain experts ourselves, discovering these features would take remarkable time and effort.