The Best Bang for Your Buck in Ames, Iowa
This post provides a summary of the project with several visualizations highlighted. View all code, model comparison, and the final presentation slides on GitHub.
Why use machine learning in real estate?
The real estate market is a high-value market in which investors can incur substantial losses or attain significant gains. Both buyers and sellers rely on real estate experts to understand the factors that boost or diminish a home’s value and accurately estimate a home's sale price. Strong content knowledge of these factors can help buyers strategize how to get the best deal and sellers strategize how to maximize their profit.
The Ames, Iowa Housing dataset provides an opportunity to explore the value of machine learning models in contributing to housing market insights. The dataset comprises 2580 homes sold in Ames, Iowa. The 81 columns include a range of features — from square footage to number of bathrooms to the home’s exterior material — and the target variable, the sale price. Building and comparing a range of machine learning models allows us to not only find the strongest predictor, but also see which features are associated with the greatest value.
For this analysis, there are two overarching goals. First, we aim to find home features that can be leveraged to increase house prices. While the analysis is strictly limited to data from Ames, Iowa, the value of many house features is applicable across regions in similar markets. Second, we aim to use a strong predictive model to identify homes that were undervalued in their sale price. Homes that have a higher predicted value than their actual sale price could offer insights into how to find a good deal in the housing market.
Data Preprocessing
To prepare the data for modeling, we had to determine how to impute missing values in several columns. For example, for Lot Frontage, the length of the lot that touches the street, we used the relationship between Lot Frontage and Lot Area for existing values to impute the missing ones. For Garage Area and Cars, we imputed missing values with the dataset mean. For Masonry Veneer Area, Basement Bathrooms, and Basement Square Feet, we imputed missing values with 0, as these properties lacked those observed features. For all categorical variables, the data dictionary provides that an input of NA represents “none” of that feature. Thus, all missing values for categorical variables were imputed with “None” as the final category.
Exploratory Data Analysis
Before modeling, we examined the data for underlying relationships between the variables. A correlation analysis revealed several variables that are highly correlated with the response variable, the sale price. Below, these scatterplots illustrate the relationship between the eight numerical variables with the highest correlation with sale price. Amongst them, several variables represent the size of the house - Ground Living Area (above ground square feet), Total Basement Square Feet, 1st Floor Square Feet, and Garage Area. Additionally, the home’s overall quality (a ranking from 1-10), the number of cars the garage holds, the number of bathrooms, and the year the home was built are in these top eight features.
Although we can’t directly calculate the correlation between categorical variables and a continuous response variable, we can determine if there is a statistically significant difference between groups across the categories. Here we show the distribution of the response variable across groups for two categorical variables, Neighborhood and Kitchen Quality. For each of these variables, there is a visible shift in the distribution of sale price amongst the groups.
Feature Engineering and Outliers
In order to increase the model’s accuracy, we created some new features that we believe could be strong predictors of sale price. Based on the correlation testing and some initial modeling, we knew that square footage and neighborhood would be two important features. We created binary variables for the garage size (small/large) and for a house remodel (Y/N). We also created a squared term of the Ground Living Area feature to capture a potentially nonlinear relationship. To capture the value of different neighborhoods, we created a numerical feature ranking each neighborhood by the size of its median household. Lastly, we created several interaction terms between the Remodel boolean and other household features. Through these interaction term model coefficients, one can interpret the increase in value of these features with and without a remodel.
The final processing step was to address outliers in the data set. We tested several approaches in a series of models to determine which resulted in the strongest model. The response variable Sale Price, included some outliers based on an analysis using the interquartile range. We tested removing those outliers completely or using a log transformation to decrease the quantity of outliers. We also tested the removal of outliers using the values of the Ground Living Area distribution, which removed only 2% of the total observations. This proved to have the strongest results, so this was the data subset set we used in the subsequent modeling.
Modeling and Tuning
To determine the best model, we used a series of Pipelines to test different feature sets and targets with different model types. Beginning with linear models, we employed StandardScaler and OneHotEncoder transformations as preprocessing steps for Ridge and Lasso regressions. Within the Pipeline, we used 10-fold cross validation on the test set. We also tested principal component analysis and polynomial features as additional components. Within the linear models, the strongest results were from a Ridge regression on the sale price, using the dataset trimmed of house size outliers. With an R^2 of 92.1%, this model accounts for the 92% of the variation within the target.
We then used GridSearchCV to determine the optimal value for lambda, the regularization strength. As lambda increases, the training R^2 decreases, since the regularization keeps the model from learning too much in the training phase. The optimal lambda is one where the training score and test score approach or cross.
We repeated this process with tree-based models, using a different set of preprocessing steps. For tree-based models, we used the LabelEncoder transformation for all variables. Although some models had similar R^2 scores to the Ridge regression, Ridge remained the strongest fit.
Evaluation and Interpretation
The best tuned model has an R^2 of 93.3%, meaning it provides strong predictive accuracy. The model’s error, or residuals, are nearly evenly distributed around the mean with some larger residuals for larger home values.
Across the variety of models, several features appear as having high importance, although different methods for feature selection result in slightly different results. In this analysis, we looked at six different measures of feature importance: Lasso and Ridge regression coefficient magnitude, SelectKBest using f_regression and mutual_info_regression measures, and feature importance ranking from Decision Tree and Random Forest models.
Among the six different measures of feature importance, several features appeared consistently in the top 5 features. Ground Living Area appeared 6/6 times, Overall Quality appeared 4/6 times, and Total Basement Square Feet appeared 4/6 times. Neighborhood Numeric (the numeric ranking of neighborhoods based on median home size) appeared 3/6 times; other models included specific neighborhood features amongst the top 5 most important. In the models’ top 10 most important features, 1st Floor Square Feet appeared 4/6 times, and Year Built 5/6 times.
Using the tuned Ridge regression model to examine coefficients, we can see the dollar value of certain features. An excellent exterior quality is associated with a $16k increase in sale price, while an excellent quality kitchen is associated with a $12k increase. For the above ground square footage, each additional square foot is associated with a $13k increase in price.
Actionable Insights
With a large feature set and sufficiently large number of observations, it is not surprising that a tuned machine learning model can be highly accurate. Real estate consultants could apply this model to future market data to make highly accurate predictions of sale prices for clients. These accurate predictions would allow both home buyers and sellers to make more strategic and data-driven decisions.
However, we are especially interested in the homes for which the model failed to accurately predict the sale price. In particular, observations with a negative residual, or error, had a predicted value greater than the true sale price. For these homes, the model suggests that the actual sale price was lower than the potential sale price as compared to other homes with similar features. While about half the homes were undervalued, as we would expect from a model with normally distributed residuals, nine homes were undervalued by at least $50k, and one was undervalued by over $100k.
These homes could offer some insight for real estate consultants, investors, or home buyers into how to identify houses with listing prices that fall below the true value. Through our analysis, we aimed to identify patterns within this subset of undervalued homes that could apply to future homes. We compared the subset of undervalued homes with the full set of homes to look for differences. Though we examined the distributions of highly predictive variables and the correlation between features, we saw no substantive differences. Further market research, including an in depth analysis and in-person visit to these particular homes, might reveal more indicators of the model’s price prediction.
Further Modeling and Conclusions
With greater time investment, there are further strategies we could try to strengthen the model’s prediction capacity. Through continued feature engineering, we could transform some of the categorical variables to binary, in order to reduce the dimensionality of the dataset after dummification. We could also use these binary variables to create more interaction effects to explain the relationship between features. We would also consider using a different ranking scheme to capture the neighborhood quality, potentially utilizing mapping to visualize home prices across geographies.
The findings of our model could be used to help home buyers and sellers maximize their profit in a high-stakes transaction. Additionally, building off the insight and content knowledge of real estate experts, the model’s error could be a useful tool for identifying undervalued and overpriced homes in the future. Given the highly accurate predictive power of the model, homes that sold at a price that is far above or below the predicted price could be used as a case study for further research.