What to Buy and How to Sell: Learnings from the Ames Housing Data Set

Posted on Sep 1, 2021

LinkedIn | Other Work | Github Repository

Housing prices have risen dramatically in the past year, driven by slow housing construction and increased demand driven by the pandemic. Buying or selling a home can be a monumental decision and being informed can save, or net, you money. This can be done by properly weighing and sorting the importance of various features of a home. Is it better for a home to be in a better neighborhood or have a larger lot? What are the effects of zoning? Does painting the walls yellow increase the sale price? This project aims to accurately predict the price of novel homes and offer you a glimpse into the importance, or lack thereof, of the features of your home.

Ames Iowa & the Ames Housing Data Set

Ames Iowa is a mid-sized mid-western town with a population of 70 thousand. Half of that population is students due to Iowa State University which also attracts several manufacturing & engineering companies to the region including Ball and 3M.

Importantly for this project, Ames is the source of the Ames Housing Data Set. Prepared by Dean De Cock of Truman State University as an alternative to the Boston housing Data Set. De Cock obtained 4 years of housing sales records and after some minor data preparation released it to the public for academic use. You can see the process and his original analysis here.

Working through the Kaggle Competition, the data set contains over 80 values for almost 3000 observations, roughly half of which are a test-set with the sale price withheld.  Each observation includes information on the features below (Table 1):

Lot & LocationBuildingMeta Data
NeighborhoodBasements & GaragesFunctionality
Streets & Alley QualitySquare FootageMiscellaneous Features & Prices
Lot Area & ShapeQuality & ConditionSale Type & Condition
ZoningRoof & Siding MaterialsSale Date (month and year)
Amenities (baths, fireplaces, pools)
Table 1: Description of Ames Dataset Features

Our goal is to use this information to predict the sales price of each home. In addition to having an accurate estimate for any given home, the machine learning models we create can help us understand what features are important, which increase the prices, and what features bring the price down.

Data Preparation

Data science would be much easier if all data was immediately ready for analysis but that keeps us all employed. Before training a model on the housing data we need to remove outliers, clean the columns, feature engineering, and standardize the information.

Outlier Removal

Outliers can severely affect how a model estimates coefficients and determines what is important. Some outliers are easy to find such as the one home with nonstandard Utilities. Other outliers are more open to interpretation. Above Ground Living Area includes two values at ~4500 and ~5600 square feet which are outliers when Above Ground Living Area is plotted against Sale Price.

Another two sales appear appropriately priced for their square footage but have a high z-score. Like Above Ground Living Area, Lot Area shows several extreme outliers with several sales having lot areas greater than 100,000 square feet (2+ acres!) (Fig. 1). These eight observations (in the top-left and bottom-right quadrants of Fig. 1) were removed from the training data. In total, 9 sales were removed from the data set, less than 1% of the total training data.

Figure 1: Above Ground Living Area and Lot Area Outliers.

Null Values

At first glance, the data appears to be full of missing values. However, as the data key notes, that most of these missing values are intentional. Most PoolQC values are null because most houses lack a pool; the same is true for basements, garages, fences, and alleys.

When this intentional missingness is corrected only three features display missing information: Lot Frontage (n=259), Masonry Veneer Type (n=8), and Electrical (n=1) (Fig. 2).

Figure 2: Count of Missing Values after adjusting for Intentional Missingness.

Electrical was imputed with the mode of the feature. Masonry Veneer Type was imputed to "None". Lot Frontage was imputed by the intersection of Neighborhood and Lot Configuration. This is based on the planned nature of Midwest towns. Most construction follows a top-down approach leaving features like Lot Area & Lot Frontage standardized across Neighborhoods. The Lot Configuration supplies important information as well, a house on a cul-de-sac has much less frontage than a house on a corner lot.

Data Typing and Correction

Some features which appear as numbers are categories, MSSubClass (type of dwelling) appears as a number but represents discrete categories such as “Split Foyer” or “2-Story 1945 & Older”

Additionally, the Exterior2nd feature includes misspelled values: "CmentBd" should be "CemntBd", “Brk Cmn” is actually “BrkComm”, and “Wd Shng” means “WdShing”.

Feature Engineering

Information about the basements and Porches were stored in a way I found un-useful. Basement Finish Type and Square Footage were stored in 4 columns (2 finish types, 2 areas). I pivoted the data so that each finish type (Average Living Quarters, Average Rec room, & Unfinished) received an individual column listing the square footage.

Porch features were organized the opposite way, each porch type had its own column which I pivoted into a Porch Type column (Open Porch, Screen, Enclosed, 3-Season, or Multiple) and a column for Total Porch Square Footage which is the sum of all porch area for each home.

Additionally, I attempted to include features that might impact the price such as calculating the area of the yard, if the home has been remolded recently, the number of floors & the Total Finished SF of the home.

Optionally, Condition1/2 (proximity to roads, railroads, or parks) and Exterior1st/2nd (Material used on the side of the house) were dummified for the linear models. Since data is stored across multiple columns, I manually construct these columns instead of using a built-in method such as One Hot Encoding.

Preparation for Modeling

For Linear Modeling:

Categorical Features were One Hot Encoded, providing a binary column for each category in each categorical column.

Numeric Features were normalized via the PowerTransformer function which centers the data at 0 with a standard deviation of 1. This is important to maintain the assumptions of linear modeling.

For Tree Modeling:

Categorical Features were Label encoded via a custom function. This allows the sorting of the labels so that an OldTown with label 3 has a lower sale price than Timber with a label of 21. Not strictly necessary for tree models but it is consistent and scikit-learn needs numeric values even in tree-based models.

 Numeric Features were left unmodified as skewed and non-centered features do not affect tree model function.

For All Models:

For the models to make sense of the data, they need to be in a specific format. For both linear & tree-based models, all ordinal features were encoded None > Poor > Fair > Average > Good> Excellent as 0 through 5. This allows the models to tell which values should be ‘rewarded’ while low values are considered worse.

Finally, the target variable Sale Price was also encoded with the PowerTransformer function (for both linear & tree models.)

Addressing Multi-Collinearity

Variance inflation factor (VIF) is one of several ways to assess multi-collinearity. VIF assesses correlations between each feature and all others. Low VIFs imply the columns are independent of each other. A VIF of infinity means a column can be perfectly explained by other columns.

Figure 3: Variance Inflation Factor for all Features.

Unfortunately, several of our features are perfectly predicted by other features (Fig. 3), these are all area features which are expected: 1st Floor SF + 2nd Floor SF + Basement SF = Total SF. However, these values can still be important to get an accurate sales price, so another method is needed to remove columns that don’t contribute to the prediction

The method I used to filter non-useful columns is mutual information (MI). Like correlation, mutual information measures how much information is shared between two values. Unlike correlation, MI is entropy-based so can handle categorical variables, non-linear, and high-dimensional relationships. A MI of 0 implies the input is not predictive of the output while a MI of 1 implies perfect information retention.

Figure 4: Cumulative Distribution of Mutual Information between Features and Sales Price. Top 10% of Features by Mutual Information.

Based on mutual information (Fig. 4) I assembled 5 feature lists: top 10%, top 20%, top 35%, features with non-zero mutual information, and the full feature set.

Model Creation & Selection

Another step to avoiding multicollinearity was only selecting models with strong penalization functions. I used the following models: Linear: Lasso, Ridge, ElasticNet, SVR; Tree: RandomForest, Gradient Boost, XGBoost.

Figure 5: Analysis workflow for Hyperparameter Tuning and Model Selection

Before training the data was split 2:1 into training and validation data (I am using validation instead of test because the Kaggle competition provides a test.csv) (Fig. 5). The training data is then passed into a custom function that runs through each model and feature set (7 models, 5 feature sets, 35 combinations). Using GridSearchCV the hyperparameters for each Model + Feature Set are determined, yielding ‘CV test scores’ for each of the k-folds that was held out of the hyperparameter tuning. This model + Features + hyperparameter can then be used with the validation data to create a “validation score”.

The Kaggle competition uses the RMSE of the log of the sales price, so RMSE was the output of this model (remember the sales price is adjusted but we did not take the log) (Fig. 6).

Figure 6: Model Error for CV-test and Validation data across Feature Sets and Models.

We can see that RMSE shrinks as the number of features increases, and Non-Zero MI and All Features have equal RMSE for most models. Overall, the linear models seem to have lower RMSE than the tree models. We can also see overfitting, especially with the RandomForest and SVR models. Ultimately, I wanted to select a linear & tree model for further analysis and that was Lasso & Gradient Boost.

Results & Analysis

We can learn much by assessing where our model made good predictions and where it began to falter. Below (Fig. 7) we can see the actual vs predicted sales prices (top) and the actual vs residuals (bottom) for the Lasso (blue) and Gradient Boost (orange) models.

Figure 7: Actual vs Predicted & Residuals Sales Price for Lasso and Gradient Boost models.

Both models perform better with lower-priced homes, however, Gradient Boost appears especially prone to outliers in this region. Both models poorly predict the most expensive homes, Lasso even more so. While RMSE is useful for diagnostics, other metrics can be more intuitive such as Mean Average Error (MAE) and Mean Average Percentage Error (MAPE). The validation data, Lasso and Gradient Boost achieve an MAE of $14,000 which corresponds to a MAPE of 8% of the sales price. So, for any given home, on average we can predict within 8% of the sales price. For average-priced homes, the error is likely even lower.

Using the Test data provided by Kaggle we can further assess the models. Lasso is most definitely overfit, having a Kaggle Test score of 0.21 (RMSLE) while Gradient Boost does better at 0.135 (RMSLE), in the top 1/3 of submitted scores (Table 2).

ModelMean Absolute ErrorMean Absolute Percentage ErrorKaggle Score
Lasso$13,9008%0.21
Gradient Boost$14,0008%0.13
Table 2: Model Diagnostics

We can further explore the importance and directionality of each feature using Lasso coefficients and Gradient Boost feature importance (Fig. 8).

Figure 8: Feature Importance for Lasso and Gradient Boost Models

Since Lasso coefficients represent the change in adjusted sales price for 1 unit change in the adjusted features between feature assessments are difficult but within features, we can see specific Neighborhoods (Crawford, Stone Brook) and Zoning types (Commercial (all)) have high importance. From Gradient Boost feature importance, we can see that Neighborhood is again important along with overall quality, total finished square footage, and kitchen quality. To get a better sense of what these importance’s mean I did some further EDA.

Figure 9: Total Finished Square Footage and Overall Quality across the twelve most popular Neighborhoods.

Above (Fig. 9) I plotted three of the most important features between Lasso and Gradient Boost models, Total Finished Square Footage, Neighborhood, & Overall Quality. We can see that these features are anything but independent. North Ames and Sawyer West have similar-sized homes, but Sawyer West’s have slightly higher Quality Scores. We can further contextualize this by taking a test case: Sawyer & Northridge Heights.

Figure 10: Feature Comparision Between Sawyer and Northridge Heights

While the lot sizes do not dramatically differ between Sawyer and Northridge Heights (Fig. 10). The Northridge Heights homes tend to be 30 years newer, have more finished square feet, and a three-point increase in assessed quality. All-in-all this leads to Northridge homes selling for more than twice the price of Sawyer homes.

Conclusions

Though it is interesting to compare homes and learn from the models, it matters little without answering the question: How do I Buy and Sell a home? From our models, we can measure what is and isn’t important.

When PurchasingWhen SellingUnimportant to Price
Total Square Feet (Especially Finished)Overall Quality & ConditionRoofing Materials
NeighborhoodKitchen QualityLot Shape
Lot AreaCentral AirPaved Streets
ZoningRecent RemodelingBuilding Type
GarageFence
Table 3: Important Factors

Housing transactions can be some of the most complicated and expensive choices of our lives and it pays to be fully informed when making them. The models I created can predict the price of any house with an error of only 8%.

While some features can be updated (such as quality and kitchen), others cannot, neighborhood and total square footage are very difficult or expensive to change. Our analysis suggests that the worst home in the best neighborhood is likely the best purchase.

About Author

James Welch

I was trained as a synthetic biologist and I am working to become a data scientists too. I have expertise in the genetic engineering of a variety of single-celled organisms, DNA and protein design, and industrial process scaling...
View all posts by James Welch >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp