Analyzing the Ames Housing Data Set: The Different Features
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Housing prices have risen dramatically in the past year, driven by slow housing construction and increased demand driven by the pandemic. Buying or selling a home can be a monumental decision and being informed can save, or net, you money. This can be done by properly weighing and sorting the importance of various features of a home. Is it better for a home to be in a better neighborhood or have a larger lot? What are the effects of zoning? Does painting the walls yellow increase the sale price? This project aims to use data to accurately predict the price of novel homes and offer you a glimpse into the importance, or lack thereof, of the features of your home.
Ames Iowa & the Ames Housing Data Set
Ames Iowa is a mid-sized mid-western town with a population of 70 thousand. Half of that population is students due to Iowa State University which also attracts several manufacturing & engineering companies to the region including Ball and 3M.
Importantly for this project, Ames is the source of the Ames Housing Data Set. Prepared by Dean De Cock of Truman State University as an alternative to the Boston housing Data Set. De Cock obtained 4 years of housing sales records and after some minor data preparation released it to the public for academic use. You can see the process and his original analysis here.
Working through the Kaggle Competition, the data set contains over 80 values for almost 3000 observations, roughly half of which are a test-set with the sale price withheld. Each observation includes information on the features below (Table 1):
|Lot & Location||Building||Meta Data|
|Neighborhood||Basements & Garages||Functionality|
|Streets & Alley Quality||Square Footage||Miscellaneous Features & Prices|
|Lot Area & Shape||Quality & Condition||Sale Type & Condition|
|Zoning||Roof & Siding Materials||Sale Date (month and year)|
|Amenities (baths, fireplaces, pools)|
Our goal is to use this information to predict the sales price of each home. In addition to having an accurate estimate for any given home, the machine learning models we create can help us understand what features are important, which increase the prices, and what features bring the price down.
Data science would be much easier if all data was immediately ready for analysis but that keeps us all employed. Before training a model on the housing data we need to remove outliers, clean the columns, feature engineering, and standardize the information.
Outliers can severely affect how a model estimates coefficients and determines what is important. Some outliers are easy to find such as the one home with nonstandard Utilities. Other outliers are more open to interpretation. Above Ground Living Area includes two values at ~4500 and ~5600 square feet which are outliers when Above Ground Living Area is plotted against Sale Price.
Another two sales appear appropriately priced for their square footage but have a high z-score. Like Above Ground Living Area, Lot Area shows several extreme outliers with several sales having lot areas greater than 100,000 square feet (2+ acres!) (Fig. 1). These eight observations (in the top-left and bottom-right quadrants of Fig. 1) were removed from the training data. In total, 9 sales were removed from the data set, less than 1% of the total training data.
Null Data Values
At first glance, the data appears to be full of missing values. However, as the data key notes, that most of these missing values are intentional. Most PoolQC values are null because most houses lack a pool; the same is true for basements, garages, fences, and alleys.
When this intentional missingness is corrected only three features display missing information: Lot Frontage (n=259), Masonry Veneer Type (n=8), and Electrical (n=1) (Fig. 2).
Electrical was imputed with the mode of the feature. Masonry Veneer Type was imputed to "None". Lot Frontage was imputed by the intersection of Neighborhood and Lot Configuration. This is based on the planned nature of Midwest towns. Most construction follows a top-down approach leaving features like Lot Area & Lot Frontage standardized across Neighborhoods. The Lot Configuration supplies important information as well, a house on a cul-de-sac has much less frontage than a house on a corner lot.
Data Typing and Correction
Some features which appear as numbers are categories, MSSubClass (type of dwelling) appears as a number but represents discrete categories such as “Split Foyer” or “2-Story 1945 & Older”
Additionally, the Exterior2nd feature includes misspelled values: "CmentBd" should be "CemntBd", “Brk Cmn” is actually “BrkComm”, and “Wd Shng” means “WdShing”.
Data on Feature Engineering
Information about the basements and Porches were stored in a way I found un-useful. Basement Finish Type and Square Footage were stored in 4 columns (2 finish types, 2 areas). I pivoted the data so that each finish type (Average Living Quarters, Average Rec room, & Unfinished) received an individual column listing the square footage.
Porch features were organized the opposite way, each porch type had its own column which I pivoted into a Porch Type column (Open Porch, Screen, Enclosed, 3-Season, or Multiple) and a column for Total Porch Square Footage which is the sum of all porch area for each home.
Additionally, I attempted to include features that might impact the price such as calculating the area of the yard, if the home has been remolded recently, the number of floors & the Total Finished SF of the home.
Optionally, Condition1/2 (proximity to roads, railroads, or parks) and Exterior1st/2nd (Material used on the side of the house) were dummified for the linear models. Since data is stored across multiple columns, I manually construct these columns instead of using a built-in method such as One Hot Encoding.
Preparation for Modeling
For Linear Modeling:
Categorical Features were One Hot Encoded, providing a binary column for each category in each categorical column.
Numeric Features were normalized via the PowerTransformer function which centers the data at 0 with a standard deviation of 1. This is important to maintain the assumptions of linear modeling.
For Tree Modeling:
Categorical Features were Label encoded via a custom function. This allows the sorting of the labels so that an OldTown with label 3 has a lower sale price than Timber with a label of 21. Not strictly necessary for tree models but it is consistent and scikit-learn needs numeric values even in tree-based models.
Numeric Features were left unmodified as skewed and non-centered features do not affect tree model function.
For All Models:
For the models to make sense of the data, they need to be in a specific format. For both linear & tree-based models, all ordinal features were encoded None > Poor > Fair > Average > Good> Excellent as 0 through 5. This allows the models to tell which values should be ‘rewarded’ while low values are considered worse.
Finally, the target variable Sale Price was also encoded with the PowerTransformer function (for both linear & tree models.)
Addressing Multi-Collinearity Data
Variance inflation factor (VIF) is one of several ways to assess multi-collinearity. VIF assesses correlations between each feature and all others. Low VIFs imply the columns are independent of each other. A VIF of infinity means a column can be perfectly explained by other columns.
Unfortunately, several of our features are perfectly predicted by other features (Fig. 3), these are all area features which are expected: 1st Floor SF + 2nd Floor SF + Basement SF = Total SF. However, these values can still be important to get an accurate sales price, so another method is needed to remove columns that don’t contribute to the prediction
The method I used to filter non-useful columns is mutual information (MI). Like correlation, mutual information measures how much information is shared between two values. Unlike correlation, MI is entropy-based so can handle categorical variables, non-linear, and high-dimensional relationships. A MI of 0 implies the input is not predictive of the output while a MI of 1 implies perfect information retention.
Based on mutual information (Fig. 4) I assembled 5 feature lists: top 10%, top 20%, top 35%, features with non-zero mutual information, and the full feature set.
Model Creation & Selection
Another step to avoiding multicollinearity was only selecting models with strong penalization functions. I used the following models: Linear: Lasso, Ridge, ElasticNet, SVR; Tree: RandomForest, Gradient Boost, XGBoost.
Before training the data was split 2:1 into training and validation data (I am using validation instead of test because the Kaggle competition provides a test.csv) (Fig. 5). The training data is then passed into a custom function that runs through each model and feature set (7 models, 5 feature sets, 35 combinations). Using GridSearchCV the hyperparameters for each Model + Feature Set are determined, yielding ‘CV test scores’ for each of the k-folds that was held out of the hyperparameter tuning. This model + Features + hyperparameter can then be used with the validation data to create a “validation score”.
The Kaggle competition uses the RMSE of the log of the sales price, so RMSE was the output of this model (remember the sales price is adjusted but we did not take the log) (Fig. 6).
We can see that RMSE shrinks as the number of features increases, and Non-Zero MI and All Features have equal RMSE for most models. Overall, the linear models seem to have lower RMSE than the tree models. We can also see overfitting, especially with the RandomForest and SVR models. Ultimately, I wanted to select a linear & tree model for further analysis and that was Lasso & Gradient Boost.
Data Results & Analysis
We can learn much by assessing where our model made good predictions and where it began to falter. Below (Fig. 7) we can see the actual vs predicted sales prices (top) and the actual vs residuals (bottom) for the Lasso (blue) and Gradient Boost (orange) models.
Both models perform better with lower-priced homes, however, Gradient Boost appears especially prone to outliers in this region. Both models poorly predict the most expensive homes, Lasso even more so. While RMSE is useful for diagnostics, other metrics can be more intuitive such as Mean Average Error (MAE) and Mean Average Percentage Error (MAPE). The validation data, Lasso and Gradient Boost achieve an MAE of $14,000 which corresponds to a MAPE of 8% of the sales price. So, for any given home, on average we can predict within 8% of the sales price. For average-priced homes, the error is likely even lower.
Comparing Lass and Gradient Boost
Using the Test data provided by Kaggle we can further assess the models. Lasso is most definitely overfit, having a Kaggle Test score of 0.21 (RMSLE) while Gradient Boost does better at 0.135 (RMSLE), in the top 1/3 of submitted scores (Table 2).
|Model||Mean Absolute Error||Mean Absolute Percentage Error||Kaggle Score|
We can further explore the importance and directionality of each feature using Lasso coefficients and Gradient Boost feature importance (Fig. 8).
Figure 8: Feature Importance for Lasso and Gradient Boost Models
Since Lasso coefficients represent the change in adjusted sales price for 1 unit change in the adjusted features between feature assessments are difficult but within features, we can see specific Neighborhoods (Crawford, Stone Brook) and Zoning types (Commercial (all)) have high importance. From Gradient Boost feature importance, we can see that Neighborhood is again important along with overall quality, total finished square footage, and kitchen quality. To get a better sense of what these importance’s mean I did some further EDA.
Above (Fig. 9) I plotted three of the most important features between Lasso and Gradient Boost models, Total Finished Square Footage, Neighborhood, & Overall Quality. We can see that these features are anything but independent. North Ames and Sawyer West have similar-sized homes, but Sawyer West’s have slightly higher Quality Scores. We can further contextualize this by taking a test case: Sawyer & Northridge Heights.
While the lot sizes do not dramatically differ between Sawyer and Northridge Heights (Fig. 10). The Northridge Heights homes tend to be 30 years newer, have more finished square feet, and a three-point increase in assessed quality. All-in-all this leads to Northridge homes selling for more than twice the price of Sawyer homes.
Though it is interesting to compare homes and learn from the models, it matters little without answering the question: How do I Buy and Sell a home? From our models, we can measure what is and isn’t important.
|When Purchasing||When Selling||Unimportant to Price|
|Total Square Feet (Especially Finished)||Overall Quality & Condition||Roofing Materials|
|Neighborhood||Kitchen Quality||Lot Shape|
|Lot Area||Central Air||Paved Streets|
|Zoning||Recent Remodeling||Building Type|
Housing transactions can be some of the most complicated and expensive choices of our lives and it pays to be fully informed when making them. The models I created can predict the price of any house with an error of only 8%.
While some features can be updated (such as quality and kitchen), others cannot, neighborhood and total square footage are very difficult or expensive to change. Our analysis suggests that the worst home in the best neighborhood is likely the best purchase.