Studying Data to Model Real Estate Market Values in Ames
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
House prices incorporate a dizzying blend of factors. Data shows details range from opaque measures of quality, like home functionality, to objective measures such as finished basement square footage. Though many features play into real estate pricing, our goal was to use 80 provided features of homes in Ames, Iowa, to build a model to accurately predict their sale price. By forecasting this value, we sought to understand the relationships between key features and house prices to help buyers and sellers make informed decisions.
In order to maximize the precision of our model, we produced sale price estimates using a stacked approach. This combines the outputs of several different methods such as linear regression, random forest, etc., to create a metamodel which performs better than any of its component models. This has the benefit of allowing each individual model to compensate for the weaknesses of the others. We will discuss this approach in more detail below.
Here is an illustration of our project workflow:
The raw data for this analysis was sourced from a competition dataset on Kaggle. The training data contained 81 features for over 1,450 houses sold in Ames between 2006 and 2010. This includes the target feature for prediction: sale price. Before we began, we transformed the data to be suitable and usable by the machine learning algorithms. Described below are some illustrative examples of the transformations applied. Note that this list is not comprehensive; the entire pre-processing script is available on Github.
Handling Missing Data
Some features in the dataset are sparsely populated. In some cases, these features are too rare to be useful. When imputation wasn’t feasible due to lack of information, we dropped the feature. An example of this is “Pool Quality and Condition” (PoolQC), shown below. It has missing values for all but 7 houses. In this case there was very little to be learned about houses that had pools, so the entire feature was ignored.
Binning Numerical Features
For some categorical features, certain values within the category are sparsely populated. Where reasonable, we binned rare categorical values with others to create a more robust feature. For example, “Garage Car Capacity” (GarageCars) has 5 divisions, 0, 1, 2, 3, 4, but category 4 is very sparsely populated.
Since both 3- and 4-car garages are above average, the difference in house price between those values is likely to be minimal. So we combined categories 3 and 4 to get a final list of categories: 0, 1, 2, and 3+.
We also combined non-categorical features when appropriate. For example, plotting “First Floor Square Footage” (1stFlrSF) against log(SalePrice) shows a reasonably normal distribution (see below). However, the same is not true of “Second Floor Square Footage” (2ndFlrSF). This is because some houses in the dataset don’t have a second floor. This adds many zeros to the 2ndFlrSF column. These zeros make linear regression difficult because they conflict with the y-intercept of the regression line implied by the nonzero data, resulting in a poor estimator for the whole feature.
To address this, we combined 1stFlrSF and 2ndFlrSF to create a TotalSF category, while also introducing a boolean feature, has2ndFlr, that indicates whether a second floor exists at all. These two new features capture the effect of 2ndFlrSF on house price and represent the relationship in a way that is more completely explained by linear regression.
Encoding Ordinal Data
Certain categorical features represent values that have an ordered relationship. For example, “Quality of the Exterior Material” (ExterQual) has values:
- Ex: Excellent
- Gd: Good
- TA: Average/Typical
- Fa: Fair
- Po: Poor
This inherent order is clear when, after the transformation, we plot them against log(SalePrice). These were converted to integers to incorporate this natural order into the model.
Data on Numerical Transformations
Some features, such as “Lot Area” (LotArea), are not normally distributed, violating a key assumption of linear models.
In these cases, we applied Box-Cox transformations to normalize the distribution.
We considered a variety of models to feed into our stacked metamodel, using packages available in both R and Python. These are summarized in the table below.
|Linear Regression||Simple MLR||(caret) leapSeq||(scikit-learn) LinearRegression|
|SVM w/ Linear Kernel||(caret) svmLinear|
|Elastic Net||(caret) glmnet|
|Decision Trees||Gradient Boosting||(caret) gbm||(scikit-learn) GradientBoostRegressor|
|XGBoost||(caret) xgbTree xgbDART||(XGBoost) XGBoostRegressor|
|Random Forest||(caret) rf||(scikit-learn) RandomForestRegressor|
|K Nearest Neighbors||(caret) kknn|
Ultimately, we chose the following four models:
- Linear Regression ('sklearn.linear_model')
- Random Forest (sklearn.ensemble)
- Gradient Boosting (sklearn.ensemble)
- XGBoost (xgboost)
We believe these four models introduce a sufficiently wide range of methodology into our metamodel, capturing the many nuances of the data and producing accurate predictions. We also trained a stacked model in R using the Caret Ensemble package, but decided not to incorporate it into our final metamodel, since it didn’t improve our predictions.
Once the data was processed and the models selected, the next step was to tune the hyperparameters for each kind of model. Different tuning parameters that control the bias vs. variance tradeoff are required for each model. Tuning is important, as poor choice of hyperparameters can cause a model to over-fit or under-fit the data. Proper hyperparameter tuning improves model accuracy for both training and test data. Additionally, hyperparameters can significantly impact model runtime. Consider, for example, the effect of n_estimators, the number of trees used in RandomForestRegressor, on root-mean-square error, RMSE, and training time:
We used RandomizedSearchCV from the scikit-learn package in Python to aid in hyperparameter tuning. This approach trains many models with cross-validation, using a limited number of random combinations from supplied ranges of hyperparameters. The error is stored for each trained model, and the hyperparameters that produce the least error are returned. Because the hyperparameters tested in RandomizedSearchCV are not exhaustive, there is no guarantee that these are the “best possible” hyperparameters. However, the output from RandomizedSearchCV can then be used as a jumping-off point to conduct GridSearchCV to find even more finely-tuned hyperparameters. See the full tuning script on Github for more detail.
Model Stacking and Performance
Model stacking is a method that combines predictions of several different models. Using the predictions from these different methods, a metamodel can be trained to produce an even more accurate prediction of the target variable. This is a powerful machine learning approach because it can incorporate models of many different types (trees, linear regression, etc.). This way, the weaknesses of one model can be counterbalanced by the strengths of another. This will capture different kinds of relationships in the data that an individual model may miss.
Model stacking has some disadvantages though. It increases computation time. This was not a major concern in our case because the dataset is relatively small. Using a stacked model also reduces interpretability, since the impact of individual features on predictioned sale price are obscured by the stacking algorithm. This challenge is addressed below in conclusions.
Data on Model Performance
Accuracy of our models was evaluated by comparing the predictions of each model with known sale prices in the training data. Below are graphs of the predicted price vs. the true price. The cross-validation score indicates the root-mean-square logarithmic error, RMSLE, of our model. Smaller error is better! Predictions lying on the line are equal to the true sale prices. Of the models used, gradient boosting and linear regression performed best. As expected, the stacked model outperformed all individual models.
As mentioned earlier, stacked models make it difficult to interpret the impact of individual features on predicted values. Therefore, it is necessary to take a step back and analyze the constituent models to consider which features are most significant. Here are some insights we gained from the feature importance of each model.
- Gradient Boosting
- Overall Quality and Total Square Footage are most important
- Kitchen Quality, Garage Features, and Central Air are also significant
- Random Forest
- Total Square Footage, Overall Quality, and Lot Area are most important
- Garage Features and Fireplaces have a significant influence on price
- Linear Models
- Lot Area, Zoning and Neighborhood are most important
- Central Air has a significant influence on price
Considering these, we can make some general recommendations to buyers and sellers:
- Kitchen and Garage Quality greatly influence the price of a house
- Buyers: consider buying a house with a kitchen that needs improvement at a low price and doing it yourself for a good return
- Sellers: consider upgrading your kitchen to fetch a higher price at market
- Central Air contributes substantially to house price
- Buyers: consider getting a bargain on a nice house without central air
- Sellers: consider retrofitting central air
- Basement Quality is more important than its size
- Buyers: save money by buying houses with unfinished basements
- Sellers: it may be worth it to invest in your basement even if it’s small
Note that these recommendations are broad, and should be assessed on a case-by-case basis. For example, the cost effectiveness of retrofitting a house with central air varies greatly on the structure and size of a given house.