Machine Learning Project: Ames Housing Dataset
Introduction and Background
The Ames Housing Dataset was introduced by Professor Dean De Cock in 2011 as an alternative to the Boston Housing Dataset (Harrison and Rubinfeld, 1978). It contains 2,919 observations of housing sales in Ames, Iowa between 2006 and 2010. There are 23 nominal, 23 ordinal, 14 discrete, and 20 continuous features describing each house’s size, quality, area, age, and other miscellaneous attributes. For this project, our objective was to apply machine learning techniques to predict the sale price of houses based on their features.
In order to get a better understanding of what we were working with, we started off with some exploratory data analysis. We quickly found that overall material and quality (OverallQual, discrete) and above ground square footage (GrLivArea, continuous) had the strongest relationship with sale price. Using a widget (shown below), we then examined the remaining features to get a sense of which were significant and how we would be able to feed them into linear and tree-based machine learning algorithms. 53 of the original features were kept in some fashion.
For categorical features, the majority were handled using one-hot encoding, with minority classes below a certain number of observations being excluded. Certain discrete features were changed to binaries if we found their presence to be more impactful than their frequency (eg. Number of fireplaces → Is there a fireplace?). Given the range of some of the continuous features in the data, we found it useful to apply log transformations where appropriate, like each house’s lot size (LotArea). Lastly, there were some special cases like the self-explanatory YearBuilt feature (figure below). We determined there to be no meaningful relationship with sale price for values prior to 1950, so we made that the minimum value.
After this initial round of pre-processing and deciding to remove two outliers (with square footage >4,000sqft. and sale price <$200k), we were well on our way to making our first set of models!
This problem lends itself well to linear regression. In fact, we can draw a simple regression line between above grade square feet and sale price that explains 54% of variance in sale price! This model produces a cross-validation error of 0.273 in terms of Root Mean Squared Logarithmic Error (RMSLE).
As a quick side note, we choose to use RMSLE for model evaluation in order to match the scoring metric of the Kaggle competition. RMSLE ‘standardizes’ prediction errors between cheap and expensive houses so that we are not incentivized to build a model that predicts better (on a percentage basis) for expensive homes than cheaper homes. Practically, it also makes our cross-validation results a more accurate indicator for Kaggle scores.
While we can obviously do better than a one-variable model, the simplistic case highlights an issue that we will need to account for in linear regression. Inspecting the residual plot, we can see a classic case of ‘fanning’ residuals. This violates one of the key assumptions of linear models -- that the error term be identically distributed.
The underlying issue is non-normal distributions of both sale price and above grade square feet.
Applying the box-cox transformation to both variables results in much more normal distributions, and a regression on the transformed variables produces a much better-behaved error plot. Surprisingly, this improvement is not associated with a reduction of model error in the case of simple linear regression. However, box cox transformation does result in a massive reduction in error when employed for multiple linear regression models.
Of course, we can start to improve our linear predictions by incorporating the influence of additional explanatory variables. There are strong (and obvious) relationships between some of the explanatory variables, such as above grade square feet and above grade rooms, and lot area and lot frontage. Regularization techniques will be critical for controlling for this multicollinearity.
Indeed, elastic-net regularization reduces cross-validation RMSLE from 0.251 to 0.118. Elastic net, ridge, and lasso all performed equally well in cross-validation.
The variable coefficients from elastic net confirmed our insights from exploratory data analysis. House size and quality seem to be the most important variables for determining sale price.
Going beyond linear regression, we next tried fitting our data to tree-based models. The simple decision tree below, with a maximum depth of 3, gives an idea of the features on which our models could split and the breakdown of how our data might be divided:
We initially feed our data to a straightforward decision tree regressor from the Scikit-Learn Python package. The feature importance plot is shown below:
While these last two are almost trivial examples, they help us get a sense of the features that might be important for this class of models as well as visualize the importance of the hyperparameters we can tune for our tree-based learners. So far, we see that the features we hypothesized to be important, from our exploratory data analysis and feature engineering, are in fact significant. We do not spend much time on the untuned decision tree model, even though it resulted in a 0.141 RMSLE training score and a cross-validation RMSLE score of 0.181; these scores are better than we expected, and they most likely are the result of our extensive feature engineering. Nevertheless, we move on to a random forest model, where we can tune hyperparameters for the number of trees, the maximum tree depth, the maximum number of features considered at a split, the minimum number of samples required to make a split, and the minimum number of samples required at a node. The feature importance plot for our random forest is shown below:
The tuned random forest resulted in a training RMSLE of 0.122 and a cross-validation RMSLE of 0.131. This score is much better than the single decision tree in large part because a random forest reduces the variance (overfitting) one sees when working with only one tree. However, there is a higher bias with a random forest than with a decision tree. This is because only part of the training data is used to train the model (bootstrapping), so naturally, higher bias occurs in each tree. Additionally, the random forest algorithm limits the number of features on which the data can be split at each node, which in turn means the number of variables with which the data can be explained is limited, inducing higher bias. In an attempt to lower the bias, we next look to a gradient boosted tree-based model; the plots below show the difference between the tuned and untuned feature importance for this model.
Even with the untuned model, the cross-validation RMSLE is 0.116 (training RMSLE 0.037), which is already better than the random forest. This is because the bias of the model is reduced because of boosting features. This model tries to reduce the error in predictions by, for example, focusing on poor predictions and trying to model them better in the next iteration, and hence it reduces bias (underfitting). One can see the importance of tuning hyperparameters, though, when looking at the important features from the untuned model. There are some potentially collinear features, such as LotArea vs. LotFrontage and GarageYrBuilt vs. YearBuilt, that show up, while some features shown to be important in our previous models, such as OverallQual and OverallCond, are absent. Yet by sequentially tuning the number of trees, the max depth, the min samples for a split, the min samples for a leaf, and then increasing the number of trees while simultaneously decreasing the learning rate, we see less “redundant” features and more relevant features (OverallCond, OverallQual, Functional) with high importance for our model. The tuned model is the best individual model we trained, with a training RMSLE of 0.082 and a cross-validation RMSLE of 0.112. Another visualization of the difference between the tuned and untuned models is shown below.
Despite being pleased with the performance of our standalone models, we thought it would interesting to use this opportunity to explore some ensembling methods, with the hope being that by combining several strong standalone models, we could produce a meta-model that is a better overall predictor.
Up to this point, our highest performing models, as judged by the Kaggle Public Leaderboard (PL), were an ElasticNet regression model (PL RMSLE 0.121), a tuned gradient boosting machine (PL RMSLE 0.122), and a tuned random forest model (PL RMSLE 0.145). Our initial approach was to average the predictions from each of our three top models and attribute equal weight to each prediction. Interestingly, our score did not improve. However, we did see an improvement once we dropped the weakest link (random forest), and attributed equal weight to the ElasticNet and the gradient boosting machine. This resulted in our strongest model up to this point (PL RMSLE 0.118). We attribute this increase in performance to the increased diversity of our ensemble that results from dropping the 2nd tree-based model.
In our final model, we decided we would explore stacking. We opted to include each of our three top standalone models (ElasticNet, Gradient Boosting Machine, and Random Forest) as base learners and opted for a linear regression as our meta-model.
As judged by the Public Leaderboard, this resulted in our top model! The RMSLE was 0.117, which as of the time of this publication was in the top 15% of submissions.
Our largest takeaway from working with the Ames Housing Dataset was the value in careful, thoughtful feature engineering. We attribute the strong performance of our model to the time we put into this phase. If we were to continue working with this dataset, we would explore both the applicability and effectiveness of principal component analysis and multiple correspondence analysis in reducing dimensionality. It would be interesting, as well, to explore the effectiveness of different, strategic tuning parameters in our stacked models.