Studying Regression Model Efficacy on the Ames Housing Data Set
The Ames Housing Data Set is popular largely due to its use as an introductory competition on the data science competition website Kaggle. In the competition, users are challenged at minimizing Mean Squared Log Error (MSLE) on a test set with target values withheld from the publicly available data set. This data set is particularly valuable because it has features that make both linear regression models and tree-based regression models appealing. Here we study the efficacy of different models applied to this problem and discuss their tradeoffs.
The Ames Housing data set concerns housing sales in Ames, Iowa with 80 features used to predict sale price. These features can be further divided into nominal, ordinal, and numerical. We provide a summary of these variables below but further information about what the variable names correspond to can be found here.
Kaggle, for the purposes of the competition, provides a training data set of 1460 house sales described by the aforementioned 80 features as well as their corresponding sale price. A corresponding test data set of 1459 observations across the 80 features with sale price withheld for the purpose of contestants predicting the sale price value to be evaluated by Kaggle for scoring. This division defines the classic train/test split used by data scientists to assess model performance. A model with high variance will perform well on the training data set and under-perform on the provided test set. Theoretically, it is possible to fit a model that perfectly describes the training data with zero error (evaluated by MSLE) as long as there are no degenerate observations over the independent features with different values for the target variable. However, this model will perform poorly on the testing data because the underlying relationship between the sales price and the independent features is not captured by the model. Instead, an over-fit model is an encoding that maps the feature space of the training data directly into the observed sale prices in the training data with no regard for the testing data. Consequently, contestants are required to properly bias their model to reduce over-fitting to the training data set to harness the information in the features that describes trends present in the testing data and yield good predictions; defined by low error in Kaggle's evaluation.
Here we will discuss this data set, modifications applied to it, and the impact of both previous items on the outcomes of our fit models for a variety of different regression techniques.
The classic approach to regression problems is linear regression, a centuries old technique that uses simple lines to predict values from input variables. In order for linear models to work well, the underlying relationship between the dependent and independent variables must be linear. Interestingly, the relationship between the features and the sale price target variable is incredibly linear; which can be inferred from this plot:
The conic pattern is due to the underlying dependence on the neighborhood the house is in. The facet plot below demonstrates that, once neighborhood is accounted for, this relationship is strongly linear. Unfortunately basic linear models do not account well for this behavior which is addressed well by tree-based regression models.
Knowing that there is a linear relationship, the next question is which variables most strongly correlate with the response. These variables are our strongest predictors for a linear model. Graphically, this can be captured by a correlation heat map.
Deep colors on the bottom row show strong linear predictors for the target variable. We immediately see that "OverallQual" (overall quality, an integer categorical variable) and "GrLivArea" (above ground living area square footage) have the strongest correlations with "SalePrice". Additionally, we see that independent variables are highly correlated, violating assumptions of linear regression. Importantly are the year the garage was built, the year the house was remodeled, and the year the house was built. The strong correlation will make it harder to harness the information contained in these variables. Regularization could also just eliminate strongly correlated variables through feature selection. Finally we can locate variables that are likely completely degenerate, particularly the number of cars that fit in the the garage and the square footage of the garage.
Missingness and Imputation
Large quantities of missing data are accounted for in the documentation associated with the data set. Standard missing values indicate that the feature is not present in the house. In the plot below, it is apparent that the missingness in the basement variables is correlated and corresponds in reality to houses lacking a basement. The same can be said for variables denoting the presence of an alley bordering the house, a pool, and fireplaces. These values can simply be simply marked missing or the missingness can be represented by an additional boolean feature.
Numerical variables that are not associated with previously mentioned categorical variables were scarcely missing. The greatest exception to that trend is the "LotFrontage" variable which is a length measure of the house border with the street. It is unclear whether this missingness is intentional or indicative of a deeper pattern. These values were median imputed per neighborhood along with an additional boolean feature indicating that they were imputed. The boolean feature for missingness will be handled well by tree models, which are sensitive to nonlinear behavior that may be associated with this quantity.
The target variable fit is not the sale price but the logarithm of the sale price. This operation serves to normalize the response variable distribution as well as reduce the effect of large sale prices on the fit value. The scoring metric is also dependent not on the error but the logarithm of the error. In the below plot of the distribution before transformation, the sale price is clearly strongly right-skewed which is effectively corrected by the log-transform. The sale price was finally standardized for model training.
Some features encoded as strings reflect ordinal rankings. These features were mapped to numerical rankings which could be fed into all models used directly without one-hot encoding. This process could negatively impact the linear models if the relationship were not linear, while speeding up training of tree models due to the reduced dimensionality from avoiding one-hot encoding. This mapping is described in the table below.
Additional boolean variables were imputed for the presence of a basement, a second floor, and whether the house was being sold new. Additional features as linear combinations of other variables are inspired by statistics often used to describe houses for sale including total square footage and total number of bathrooms (in half-integers). Features describing sale date were transformed so that temporal variation could be modeled sinusoidally to reflect periodicity over months.
As previously mentioned, the year the house was built is highly correlated with both the year the garage was built and the year the house was remodeled. In fact year built effectively places a lower bound on the latter two variables. In order to capture the information in these variables, two new variables are imputed as the difference between the year of garage construction or remodel addition and the house construction. The imputed variables capture a linear correlation with the sale price of the house at the expense of behavior around zero on the domain corresponding to construction at the same year as the original house. While proper decision boundaries for tree models can model this behavior, standard linear models are incapable of accommodating deviations from linearity.
Finally categorical features were one-hot encoded for compatibility with models used. Numerical variables were power-transformed and robust-scaled for normalization and resistance to effects of outliers. The code used for the treatment described in this section can be found in this GitHub repository.
Different models select feature importance based on how well the model is able to capture the relationship between the feature and the target variable. Below is a feature importance computed from a random forest model on the engineered data set.
Notably, the "OverallQual" variable, square footage variables, and garage size variables that are strong predictors for linear models are absent from this table. This works both ways, tree models struggle to capture simple linear relationships that explain large quantities of response variance while eventually being capable of doing so with more complex ensemble models that perturbatively approach linear boundaries at the expense of potential over-fitting. Linear models do not well describe categorical features or fail to include hierarchical information because they don't linearly correlate with the response or demonstrate dependence of feature variables.
The models presented here, with the exception of KNN, work are presented with the full set of features and allowed to perform feature selection on their own. Anecdotally, the regularization provided by the models following effective hyperparameter searches with cross-validation more rapidly determine feature inclusion and importance than human selection despite the increase in model training time.
Regularized Linear Regression (ElasticNet)
ElasticNet is the simplest form of regularized linear regression. Two hyperparameters define the model: alpha for regularization magnitude and the L1 ratio that controls the tendency to eliminate variables over just reducing the magnitudes of fitted coefficients. In fitting an ElasticNet model to our data, we found experimentally that both the regularization magnitude and L1 ratio are small in an optimal model. This reflects the wealth of observations to fit the relatively simple model over the dimensionality of the feature space. Additionally, performance may be gained by engineering more features to the extent the model must be regularized more heavily. This is inherently limited to the breadth of the observed search space but the strengths of this model are the rapid fitting process and the low dimensionality of the hyperparameter search space. Consequently, it is easy to have confidence in hyperparameter choice with a relatively short time searching.
For this particular problem, as has been previously indicated, linear models are particularly powerful because of the strong linear correlation between the response predictive variables. We were able to obtain a Kaggle score of 0.12846, making it the most performant on this metric of the models that we will present in this posting. The greatest weakness of this model is the high dependence of strong predictors on other features that are not well utilized by a linear model. Chiefly, this is the function the neighborhood has on the house price that ideally would manifest itself in the relationship strong predictors like house square footage have with the sale price, as was presented previously. However regularized simple linear regression like ElasticNet is not capable of capturing this relationship due to the lack of linear correlation between these features; more advanced models such as hierarchical linear models are necessary to fully appreciate this behavior. Additional feature engineering to produce interaction features may further reduce assessment score for this model.
The ElasticNet model was found to not rely strongly on L1 ("lasso") regularization. Consequently, a purely L2-regularized ("ridge") model was studied as a point of comparison. At the cost of larger value for the single regularization hyperparameter used by this model, ridge regression was nearly as performant on the assessment metric with a Kaggle score of 0.12486, only a slight deviation from the score achieved by the more complex ElasticNet model. This reinforces interpretation of L1 regularization as a weak effect for producing a well-regularized linear regression model on this data set.
Gradient Boosted Regression Trees
Gradient boosted trees are an ensemble model that benefits from being able to take advantage of the hierarchical structure present in this data set by discriminating against these features early in the construction of the tree. Over a large number number of trees, this model is able to perturbatively improve the decision boundary used to derive predictions and capture linear, nonlinear, and interactions of features for prediction of the response variable. The consequence is that to learn these behaviors a large number of trees must be used making the model susceptible to over-fitting. The hyperparameter space for gradient boosted tree models is also significantly larger than that for linear regression models. Combined with the additional time complexity of training individual models, searching for and fitting an optimal model has a much longer time requirement than the previous linear models. We found experimentally that a large number of estimators is needed to train on this data set however the trees used were shallow and highly regularized to make up for over-fitting on a simple linear relationship. Ultimately we achieved a score of 0.12597 with this model.
Another package for training gradient boosted trees is Microsoft's LightGBM which improves upon the basic gradient boosted tree algorithm. Those improvements also come with an even higher dimensional hyperparameter space to search and multiple different ensembling methods. Improvements to the algorithm provide for faster tree training in comparison to XGBoost, and also provides good out-of-the-box performance. Ultimately we achieved a slightly better score model, evaluating to 0.12577.
For both of these algorithms improvements can likely be had at the price of significantly more time spent on the hyperparameter search.
K-Nearest Neighbors (KNN)
Beyond the tree and linear regression models, we implemented a KNN model to compare how a model that is highly dependent on feature space dimensionality would perform on this data set. Models based on KNN have the advantage of resiliency to noisy data and nonlinear features. Due to the reliance on geometric distance, KNNs are particularly fragile in high dimension feature space. For this reason, we predetermined feature importance to constrain the number of variables put into the model: numerical features (continuous and discrete) were assessed by lasso regression and the top 15 influential categorical variables (nominal and ordinal) were selected by random forest. A primary challenge with this model was selecting the optimal method for dealing with the distance metric for categorical variables. Ultimately the best performing model was the one in which the nominal categorical variables were one-hot encoded and combined with normalized and scaled numerical variables. Importantly, this model underestimated the importance of the neighborhood feature in the data set. We attempted to weight the distance metric applied to the categorical features by standardized percent increase MSE from the random forest (the metric for feature importance) but this degraded the model. The final version of this KNN model had middling performance on prediction of the Kaggle set with a score of 0.1623, likely due to the sacrifice of features for the sake of dimensional reduction.
Support Vector Regression (SVR)
Support vector regressors provide another method of studying linear relationships while also being adaptable to nonlinear relationships using other kernels. Using SVR we were able to confirm that our data was properly linear (or linearized) by observing a performance loss when adopting polynomial kernels in comparison to linear. However, radial basis functions were observed to provide the best performance of the three kernels commonly used with this technique. Intuitively, the strong linear relationships we observed would guide motivate the use of linear kernels. One-hot encoded variables introduced clusterings around their binary values that are well-handled by the support vectors using RBF kernels. SVR with RBF evaluated well against the assessment metric with a Kaggle score of 0.12686. We conclude that SVR with RBF kernels provides a third method of dealing with the relationship between linear numerical variables and one-hot encoded categorical features. Further gains may be possible through feature elimination as the regularization used to prevent over-fitting in SVR is done by biasing the model rather than selecting features as in linear regression or decision trees.
The nonlinear interactions between variables not captured by linear model potentially favors a neural network which may learn both nonlinear interactions and the linear dependence for small networks. We constructed a neural network with 2 hidden layers of 10 nodes each and using a logistic activation function. Although the prediction for the majority of the points was relatively accurate, the model made some very poor outlier predictions, which diminished its prediction performance on the test data set for a final Kaggle score of 0.19072. We attribute the low performance of the neural network to the relatively few number of observations in comparison to the numerous variables and internal feature weights; enough to learn the test set but not prevent over fitting for performance on the test set. Future work would concern reducing the node count of the network and applying regularization technique. This is likely not to yield much benefit due to the relatively small number of observations for a model of this complexity.
Various models were studied for their predictive power on the Ames Housing data set. Linear models were found to be highly effective and easily tunable but unable to account for all some elements of the data set. Tree models account for these variables at the expense others and the cost of a much larger hyperparameter search space limiting their final performance. Further work is necessary to ameliorate over-fitting as well as provide a feature space for optimal performance in evaluation.
Summary of Results
Example code, plotting commands, and the full data work-up can be found at this GitHub repository.
Further study of this data set for improved model performances as well as further models of interest:
- Addition of more features
- Feature selection for some models
- Hierarchical linear models
- Model stacking