Power of a Predictive Model for Ames, Iowa Housing
Image by freepik
Rationale
With constant changes in the economy, finding an appropriately priced house is becoming more and more challenging. The task is already challenging enough with each house claiming a justification for its price that often appears at odds with the pricing of comparable listings. Why is this house more expensive than one that was built more recently and that has more livable square footage? Does an older house typically go for a higher price than a newer one? What model is used to price housing listings?
These questions are often answered by real estate agents and insurance companies/property appraisers and assessors. As there is a possible conflict of interest for those who have a vested interest in setting the price higher or lower, buyers and sellers should turn to a source they could trust to verify a property's valuation.. Creating a data model would also answer these questions while mitigating bias.
The Ames Housing Dataset showcases how a model can be used to predict whether the price of the house is appropriate for the listing. This dataset includes 2580 housing listings in the town of Ames, Iowa from the years of 2011 to 2016. It also includes 81 features of the listing. These include the MSZoning of the property, and the year built, among other features. The dataset and data dictionary is provided here.
Data Cleanup
When first looking at this dataset, we can see that there are many missing or 'NA' / 'NaN' values. I filled the values of the respective rows in the dataset with the average of the specific category. For example, when a record in the dataset shows a listing with a detached garage, we use the average square footage of detached garages in the neighborhood as a filler. Otherwise, 0 or 'None' was used for missing values according to the data dictionary provided. The figures below show all of the missing values in the numerical (fig. 1) and categorical columns (fig. 2) in this dataset.


Exploratory Data Analysis (EDA)
When first looking at the dataset through histograms we can see the distribution of the dataset with respect to some feature. Here we can see that most of the listings were sold in June. It also shows that most of the listings either do not have a pool or the pool data was not included in the house listing. This is also true for any type of porch and the miscellaneous value of the house.

Numerical Features
Next we look at a scatter plot of the numerical features plotted against the price. This may make some relationships more apparent.

The scatter plots indicate that the features that deal with square footage and area of the property have a stronger relationship to the price than other features. That a larger property sells for a higher price and makes intuitive sense.
Categorical Features
With box-plots we can also represent the comparison of different groups for categorical variables.

One feature that stands out is the categorical variable OverallQual. As the OverallQual increases, we find that the sale price of the property also rises. There seems to be a slight parabolic relationship.

From these graphs we can see that the time of year of the listing purchase does not seem to have an effect on the sale price. Many of the categorical features shown in the graphs above do not seem to have a substantial influence on the sale price of a listing.
To quantify and compare the correlation between features we can also create a heat map of the features. Below is a Person's correlation heat map that shows the correlation between features.

Again we can see that the features pertaining to land size have the highest correlation coefficients to the price of the listing. It is also apparent that many of the features are not independent of each other. This makes intuitive sense because many of the features are just a combination of other features. For example, GrLivingArea is equal to 1stFlrSF + 2ndFlrSF.
We also have to check the distribution of the dependent variable, Saleprice. To check this, I created a Q-Q plot of the distribution of the sale price and a normal distribution.

This shows that the distribution of the sale price is close to normal, though it is not 100% normally distributed.
Models
Multiple Linear Regression (MLR)
The first model created is a multiple linear regression. Because of the multicollinearity of the features, a model with all of the features included yields an invalid model. To combat the multicollinearity, I used a Lasso regression to eliminate unimportant features that confounded others. I increased alpha to 100 to eliminate the features needed. The final model had 50 features with 7 categorical and 43 numerical variables.
The preprocessing for this model included filling in the empty features with their respective values as described above. I then dummified the categorical variables to fit the model and the results are as follows:
5-Fold CV: [0.87637975, 0.87170431, 0.89832157, 0.88903719, 0.8767197]
Avg Score: 0.8824325029749648
Variance: 0.009802555355095176
Mean Absolute Error (MAE): 14993.9308422297
Lasso Regression
The preprocessing for this model is very similar to the multiple linear regression. However, for this model scaling of the numerical columns was necessary. I also kept all of the features for this model and yielded 263 feature columns after dummifying them.
Basic Model
This model is very similar to the first with the key difference being the penalty term added. This was an attempt to decrease both the bias and variance of the model. I also cross-validated the model with a five-fold cross validation. To verify the process of the cross-validation provided by the scikit-learn library, I performed a manual cross-validation.. See this link to GitHub where the process of cross-validation and verification is shown to be the same as the function provided. The notebook also has documentation of how the grid search was performed. I initially ran a lasso model using the standard alpha value of 1 to produce the results below.
5-Fold CV: [0.9136383, 0.9094807, 0.9219369, 0.9113320, 0.8994931]
Average Score: 0.911176214002537
Score Variance: 5.224943623285341e-05
Mean Absolute Error (MAE): 15014.7596649066
Hyper-parameter Tuned Model
I then ran a grid search with 200 alphas in from a range from 0.001 to 400. This was to find the optimal alpha to yield the best score. I then got an optimal alpha value and proceeded to narrow the range from 5 to 105 to get a more precise alpha value. Once this process was complete, the optimal alpha value came out to be 18.09140703517588. The performance of this model with alpha = 18.091407 had these results:
5-Fold CV: [0.915465, 0.914625, 0.926296, 0.914839, 0.906486]
Avg Score: 0.915542
Variance: 3.980348e-5
Mean Absolute Error (MAE): 14487.5988681087
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_alpha | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9 | 0.316764 | 0.066726 | 0.001932 | 0.001041 | 18.091407 | {'alpha': 18.09140703517588} | 0.915465 | 0.914625 | 0.926296 | 0.914839 | 0.906486 | 0.915542 | 0.006309 | 1 |
8 | 0.357375 | 0.074949 | 0.001409 | 0.000273 | 16.081362 | {'alpha': 16.081361809045227} | 0.915981 | 0.914712 | 0.926181 | 0.914746 | 0.906060 | 0.915536 | 0.006397 | 2 |
13 | 0.219259 | 0.051798 | 0.002175 | 0.001150 | 26.131588 | {'alpha': 26.13158793969849} | 0.913709 | 0.914235 | 0.926558 | 0.915287 | 0.907765 | 0.915511 | 0.006115 | 3 |
10 | 0.288941 | 0.064642 | 0.003860 | 0.003044 | 20.101452 | {'alpha': 20.101452261306534} | 0.914835 | 0.914518 | 0.926415 | 0.914934 | 0.906819 | 0.915504 | 0.006265 | 4 |
12 | 0.247641 | 0.045751 | 0.001957 | 0.000959 | 24.121543 | {'alpha': 24.121542713567838} | 0.913978 | 0.914320 | 0.926529 | 0.915201 | 0.907480 | 0.915502 | 0.006160 | 5 |
Random Forest
This is a tree based model where the preprocessing is a little different from the other two models. For this model, the categorical variables were label encoded. The null values were filled in the same manner as the other two models and the numerical columns were not scaled.
Because of the complexity of the random forest model, running a model with too many features could be too computationally expensive. To combat this issue, I ran a basic model and I kept the top 25 features of the model. I found that this gave me a score similar to the model with all 81 features while drastically saving time. This is the model I proceeded to hyper-parameter tuned. The top 25 feature basic parameter model yielded the results below:
Basic Model
5-Fold CV: [0.89937707 0.88896888 0.91208996 0.90068164 0.89815625]
Avg Score: 0.8998547613063407
Variance: 5.439986003774692e-05
Mean Absolute Error (MAE): 14073.6235987055

Hyper-parameter Tuned Model
I then ran a GridSearchCV changing the parameters as follows
- 'n_estimators': [300,400,500,700,800,900, 1000],
- 'max_features': [6,8,10,12,14,16],
- 'n_jobs': [-1],
- ‘random_state':[42]
I found that the best parameters were as follows and yielded the following results:
- max_features=8
- n_estimators=1000
5-Fold CV: [0.90603073 0.90814054 0.92284517 0.91104507 0.90473758]
Avg Score: 0.910559819855974
Variance: 4.228588616393399e-05
Mean Absolute Error (MAE): 14038.422081507166

Model Evaluation and Selection
This is a summary of the performance of the models. The table below shows the basic and tuned model R2 scores. Because R2 is standardized between models, it makes model comparisons straightforward. As seen in the table, the highest model R2 score is the Lasso regression model followed by random forest, and finally the multiple linear regression (MLR).
Model | Basic Model Score | Tuned Model Score |
MLR | 0.8824325029749648 | N/A |
Lasso | 0.911176214002537 | 0.9155521869680454 |
Random Forest | 0.899854761306341 | 0.910559819855974 |
Another way to measure the performance of a model would be the mean absolute error (MAE). This is a measure of how well a model performed. In this case the MAE is measured in dollars ($), making the interpretation of the model performance easier. The table shows the lowest MAE being the random forest model, then the Lasso, and, finally, the multiple linear regression.
Model | Basic Model MAE | Tuned Model Score MAE |
MLR | 14993.9308422297 | N/A |
Lasso | 15014.7596649066 | 14487.5988681087 |
Random Forest | 14073.6235987055 | 14038.4220815072 |
Here is a box plot of the models and their respective model variances.

Conclusion and Possible Future Work
After evaluating all the models, we can see that the model with the lowest R2 score and the highest MAE is the multiple linear regression model. The variance of all models was less than 0.01 when doing a 5-Fold cross validation meaning that each model is rigid to different datasets. The model with the highest R2 was the Lasso model, while the model with the lowest MAE is the Random Forest model.
The Random Forest model brought to light that OverallQual was the most important feature in pricing. ''. This is consistent with what was seen in the EDA, where the box plot of the OveralQual vs SalePrice seemed to have a relationship. The OveralQual feature may have been highlighted more in the Random Forest model because the box plot showed that it did not seem to have a linear relationship with SalePrice.
These models can be further expanded upon with feature engineering. In these models, there were not any new features created from the given details of the listings. For example, a feature could be engineered with a combination of features with OveralQual serving as the scale for the new feature. We could also see data from before 2011 or data from other fields. For example, we could cross reference crime statistics in Ames Iowa and try to incorporate that into the model. We could also add geospatial data for more details within a neighborhood.