Creating an interpretable model of the Ames Dataset

Zach Stone

Posted on Dec 13, 2022

This work was done with the guidance of the Data Science with Machine Learning bootcamp at NYC Data Science Academy. Sample code used for the research can be found on github.

Background

The Ames Housing Dataset is a feature-rich collection of home listings in Ames, IA along with their sale price. This dataset is commonly used to demonstrate the need for feature selection, model tuning, and other techniques in supervised learning. Linear models and decision tree models are considered the most appropriate and high-performing models for sale price prediction. While out-of-the-box models can account for up to 92% of the variance in sale price on the full dataset, accuracy can be improved by removing outliers. However, this research primarily trained and tested models on the full dataset.

Goals & results

While more powerful supervised methods were tested as well (XGBoost, Random Forest, Lasso), the primary goal here was to use feature selection to create an interpretable, comparably performing, linear model. Compared to random forests or other ensemble models, a linear model is much more interpretable: the coefficients represent specific rates - in this case, dollar values - associated with the features.

All models were trained and tested on a cleaned and engineered version of the dataset. Some observations were adjusted to account for missing values or values that were logically inconsistent (e.g., garages with no square footage and car space, or remodel dates before the build date). Additionally, some categorical features were binned based on domain knowledge and exploratory analysis. To increase confidence in the specific values assigned to the selected features, the techniques for feature selection were specifically chosen to reduce the standard error of the coefficients without severely impacting accuracy. The final models had the following advantages:

Standard linear regression and lasso models on the selected feature set were much less overfitted than other models. The reduced linear model obtained train/test R² scores of 92.6%/91.8% and an average cross-validation score of 91.1% on the full dataset.
The final model used a subset of features which made a compromise between the reliability of the coefficients and AIC/BIC scores. This model improved the MAE by $40k on test data compared to the null model.
Feature selection led to a model with drastically smaller confidence intervals around its coefficients, dropping from an average of over 530% relative standard error per feature in the full model compared to 23% in the reduced model. Having reliable coefficients shows the contribution of each feature to the final appraisal.

The first point refers to the linear models performing similarly on training and test sets and in cross validation, while more advanced models, like XGBoost and Random Forests, showed a significant difference between their training and test scores. The third bullet refers to the improved reliability of the estimated value contributed to the sale price of each home by its features. These can be used to evaluate the benefits of various potential improvements when preparing a house for sale, compare investments, or evaluate the reliability of various inspections of the house.

Accuracy

linear-selected-predictions-195464-9tKuADW4 | Data Science Blog — Plot showing the predicted vs actual price on a test set The listings with the 10 largest residuals are highlighted

The plot above shows the predicted vs. actual price on a test set consisting of 30% of the dataset when the model is trained on the remaining 70%, giving an R² of 91.8% and a mean absolute error (MAE) of $14,418.43. For comparison, the null model which guesses the average sale price of the training set has a MAE of $55,061.91, showing that the linear model improves over the null model by over $40,000 per listing on average. Of the listings with the 10 largest residuals, four of them are the four most expensive houses in the test set. When these outliers are removed, the MAE drops slightly to $13,678.69, rounding the R² up to 92.1%. Additionally, using the same subset of features on the full dataset, a standard linear model had an average R² score of 91.1% in a 5-fold cross-validation test.

The out-of-the-box linear model on the full set of (cleaned and engineered) features had a slightly higher train/test scores of 93.2%/92.0%, though the test score is comparable to the reduced model. Additionally, the average cross-validation score on the full dataset was 87.4%. An F-test does reveal that the full model performs better than the reduced model on the full dataset. However, the increased difference between the train and test scores and the significantly lower cross-validation score indicate that the full model is slightly overfitting when compared to the reduced model.

Comparing models

accuracy-comparison-924840-4fOerjkQ | Data Science Blog

By comparison, an out-of-the-box Gradient Boosting Regressor on the full set of features had a train/test score of 96.2%/91.9%, and average score of 90.0% on a 5-fold cross-validation test. (Though, it should be remarked that ordinals were encoded as numeric for both tree models, while individual values were dummified in the linear models.) While this model overall has higher accuracy, the test and cross-validation scores indicate that it is overfitting more than the linear models (reduced and full). Tuning the number of estimators and tree depth exacerbated the problem, raising the training accuracy to nearly 100% while the test accuracy remained the same.

Similarly, a tuned Random Forest Regressor was also overfitted with 98.2%/88.3% train/test scores, and cross-validation confirmed a score close to the test score. However, a properly tuned CatBoost algorithm can reach an average 92.6% cross-validation score on the full dataset with all observations and features. The metrics above show that the interpretable linear model, on an appropriately selected subset of features, suffers only a slight injury to R² score when compared to these more powerful models on test and cross-validation sets (at least compared to when they are trained and tested on the full feature set).

The tuned Lasso model is the least overfitted of the models and performed comparably on the cross-validation test. When only the subset of features from the reduced linear model are used, tuning the Lasso model by cross-validation pushes it towards the regular linear model without any penalty. That is, the alpha value becomes very small, so that no penalty is applied. The result is that the model becomes almost identical to an unpenalized model.

Interpretability

The feature selection methods used were intended to compromise between reducing the standard error of coefficients while maintaining model accuracy. Since each coefficient will have units of $/unit (or just $, as with the constant term), they are all measured on a ratio scale, so it makes sense to compute their relative standard error (RSE). This is defined as the standard error divided by the value of the coefficient itself, and it is a unitless ratio. The confidence interval of a coefficient is roughly the value of the coefficient ± 2 * the standard error. If this interval contains 0, we cannot be sure that the feature contributes meaningfully to the model, since the coefficient is not statistically distinguishable from 0. Hence, the following are equivalent:

The feature going with a coefficient is a statistically significant predictor.
The absolute value of the coefficient is larger than 2 * standard error.
The RSE of the coefficient is less than 50%.

Features were selected such that the reduced model had all coefficients with RSE less than 50%, so that all of them were significant. Moreover, the average RSE in the reduced model was 23%, compared to the full model, where the average RSE is over 530%, meaning that the average standard error was over five times the value of the coefficient itself in the full model.

rse-hist-746313-syBTIEZ7 | Data Science Blog

Numeric features

Thirteen numeric features persisted through the selection process. Eight corresponded to area. Since the total finished interior square footage is the sum of the square footage of each room type, one has to choose either the total areas or the individual room areas -- but not both -- in order to avoid linear dependence (see here). Only the total interior square footage was used, here separated into 1st and 2nd floor square footage, both valued at about 50 $/sq.ft. This performed better than using the area of individual types of finished rooms. Other areas, including outdoor, garage, finished basement, and low quality areas, were also appraised by the model, as was the surface area of masonry veneer. These values could be used, e.g., to estimate the ROI on constructing certain extensions of a home or remodeling unfinished areas.

area-values-544054-uwEZMQ3g | Data Science Blog

Information relating to the age of the house and when it was remodeled account for two more significant numeric features. We can see that houses depreciate at a value of over 300 $/year, while the advantage from remodeled depreciates at a value of about 100 $/year.

date-value-235851-winLPAg5 | Data Science Blog

Additionally, certain counts were treated as numeric features. These deserve some explanation.

count-value-804512-u5kKAxn3 | Data Science Blog

Most homes have only one fireplace, if at all, whose evaluation possibly acts as a proxy for the 'grandeur' of the home.
The number of full basement bathrooms acts as a proxy for whether the basement is also a livable area, with most listings having 0, 1, or 2 basement full bathrooms.
The last is the most surprising, with increased bedrooms corresponding to a decrease in sale price. However, inspecting the sale prices against the bedroom count makes more sense in light of the dwelling types.

bedroom-price-dwelling-291143-KHJedFHZ | Data Science Blog

The average number of bedrooms per listing is 2.85, so many homes near this average will incur a comparable penalty. Many of the listings with a higher number of bedrooms, incurring a higher penalty, are one of two types: more modern 2-story homes, or duplexes. In those homes having four or more bedrooms, the lower end of the price distribution is mostly occupied by duplexes - which generally sell for lower than other dwelling types - while the higher end is occupied by the modern 2-story homes. The penalty incurred by the bedroom count for those homes may be offset by their other advantages, namely being newer and having larger square footage than other single-family homes.

Categorical features

Many of the remaining features are dummified categorical variables. While we will not cover all of them here, most of them fell into one of the following categories: (1) neighborhood, (2) exterior type, (3) dwelling type, (4) condition and quality inspection ratings, or (5) data about the functionality of the basement. Neighborhood is, as expected, one of the strongest predictors.

neighborhood-price-800078-IMusqF20 | Data Science Blog

The neighborhoods above were identified by the selection process as contributing significantly to the house price -- either as an advantage or penalty. These advantages/penalties are in comparison to houses in all neighborhoods not listed. For listings in other areas, the neighborhood does not significantly contribute to the price of the house.

Another strong predictor was the dwelling type. All dwelling types which came out as significant through the selection process were penalties, corresponding to different alternatives to traditional single-family dwellings. In particular, duplexes, 2-family conversions, and planned unit developments all incur penalties. The dwelling types not deemed significant - i.e., incurring no penalty - were exactly the single-family homes, regardless of type (1, 1.5, or 2 story, split-level, etc.). This supports the argument from this section that duplexes sell for less than other types of dwellings, which partially accounts for the negativity of the coefficient for bedroom counts.

dwelling-price-331539-GZcZBBSf | Data Science Blog

Finally, the overall quality and condition inspections were also strong predictors, both rated on a scale of 1-10. For condition, 5 was the median value, and all values except this median and the extreme values of 1 and 10 were reliable predictors. Moreover, significant scores below the median (i.e., 2-4) were all penalties, while significant scores above the median (i.e., 6-9), were all advantages. As expected, the contribution to price changes monotonically with the score among the significant scores.

Similarly, 6 was the median quality score, and all scores 4-10 except the median were significant. Again, the significant values below the median (4 and 5) were penalties, while those above the median (7-10) were advantages. The contribution of quality to price is nearly monotonic, with the penalties associated with scores 4 and 5 within each other's confidence interval.

These estimates of the value of various ratings demonstrate that such inspections can be reliable metrics, whose results contribute significantly to the appraisal. However, most other inspection results did not reliably contribute to the sale price.

quality-cond-value-756329-cZCgJAf5 | Data Science Blog

Data preparation

The Ames dataset is incredibly rich with 79 features, covering a range of numeric, ordinal, and nominal categorical information about each listing. The numeric data includes square footage of various types of rooms, the length of the perimeter touching a street, and information about the age of the house and renovations; the ordinal data contains different room counts and inspection results; and the nominal data includes the neighborhood names, materials used in various house features, and information about the nearby environment.

Other than encoding the various features, e.g., converting Likert-like ratings to integers and dummifying categorical features, minimal changes were made to the data. Outliers were kept, and the features used in the linear model were not rescaled. A few logical inconsistencies were adjusted based on available information - for example, one remodeling year was dated after the house build date. Additionally, a few listings were missing a small amount of information, such as the type of electrical system used in the house, which could generally be imputed with the majority value without drastically changing the dataset.

Otherwise, only a few features were engineered: (1) many categorical features had a high number of categories, which were binned based on domain knowledge (e.g., various types of veneer were collapsed into meta-categories, such as 'brick', 'wood', or 'manufactured') in order to not have an explosion of dummy features, and (2) the quantity formed from the product of unfinished basement square footage with the basement quality inspection gave a more reliable metric than either independently. The idea behind this second metric is that there may be a gradation of the types of unfinished basement, though it did introduce the naive assumption that the value of a square foot of unfinished basement varies linearly with the quality inspection.

Feature selection methods

A number of feature selection methods were used in combination to select a combination of features whose individual values were reliable, while also maintaining model accuracy and generalizability. A number of factors can contribute to the (relative) standard error of the coefficients. Just to name two:

If too many predictive features are used, it can be difficult for the model to determine the contribution of each
If there is a linear combination of all observations of a set of features f₁, f₂, ... which is sufficiently close to zero (i.e., the features are nearly linearly dependent), the arbitrary multiples of these coefficients can be added to the model coefficients without changing the prediction much. Hence, a huge range of coefficients will be possible, each producing models making nearly identical predictions, so we will not be able to rely on the particular values a model decides on.

The first is especially relevant, since dummification of the many categorical features in the dataset explodes the dimension of the data. The second is sometimes referred to as multicollinearity, though it is important to note that this refers to collinearity in the space whose coordinates represent different observations (of the same feature), not in the space whose coordinates represent different feature values (of the same observation), though data is typically visualized in the latter.

Lasso

One technique which can address both is using the coefficients of a lasso model to determine the significance of each feature. Such models penalize large coefficients in a way which coerces some of them very close to zero. A lasso model was tuned using cross-validation on the training set, and those features with small coefficients were removed. The histogram below shows that a large majority of features can be excluded this way, with many having coefficients very close to 0.

Linear dependence

However, the remaining features were still heavily linearly dependent in the sense described above. For example, certain totals (like square footage) are by definition the sum of other features (like the square footage of each room type), and so a linear combination of them equalling exactly zero exists. Removing a feature which is (close to) a linear combination of the remaining features will not reduce the information available to the model. However, as the way in which families of features may depend on each other can be complex, it is not always clear which feature to remove to reduce the linear dependence.

To handle this, an iterative method was used consisting of the following steps (1) see how much each feature is linearly dependent on the others by checking the R² of the linear model predicting that feature from all other features, (2) loop through the features, starting with the most predictable, and check if removing the feature improves (or at least retains) the accuracy of the model under cross-validation, (3) if removing a feature retains or improves the cross-validation score, drop it, and repeat the process; otherwise, continue down the list until such a feature is found, and, finally, (4) if removing any feature reduces the cross-validation score, terminate the process.

This method heavily reduced linear dependence among the features. While the features remaining after lasso selection are 86.7% linearly predictable from the other features on average, this number is reduced to 56.6% after iterative selection. Despite eliminating many features, the cross-validation score persisted at 90.4% upon removing features this way, as the removed features can largely be captured by the remaining ones.

Finally, additional features were considered for elimination based on high predictability from the remaining features and high standard errors. Various combinations were tested based on AIC/BIC scores. Those which reduced AIC/BIC scores the most without reducing cross-validation scores were removed, resulting in the final selection of features.

Summary

This research attempted tackle an alternative goal with the Ames Housing Dataset: construct an interpretable model which was competitive with the more advanced models through feature selection. This allows us to estimate the contribution of each feature to the sale price in addition to automating the appraisal process. Though taking a slight hit to accuracy, the coefficients in the reduced model were significantly more reliable than in the full model, reducing the average RSE from over 530% to 23%. Additionally, the reduced linear model was less overfit than the more advanced models.

While the reliability of the coefficients was significantly improved, it is important to remark that the values are relative to other houses (or the same house upon changing a feature), as they are offset by a constant. However, such estimates could be used to determine which improvements to make on a house, the estimated annual loss due to the age of the home and/or renovations, and other values important to investors. While Lasso is otherwise the most competitive model, it collapses to regular linear regression upon restricting to the selected features, indicating that the choice of features is appropriate.

About Author

Zach Stone

I am a data scientist with a background in linguistics research and math. I love to make it easier to analyze and draw insights from complex patterns using a combination of research, code, and modeling.

View all posts by Zach Stone >

No comments found.

Creating an interpretable model of the Ames Dataset

Background