Boosting Real Estate Decisions
How can we advise better real estate purchasing decisions in Ames, Iowa?
Introduction
Buying a house can be a life-changing decision for a first-time home buyer, or it could just be another business purchase for house flippers. Even though these purchasing decisions have been around long before the emergence of widespread computer usage and machine learning, we can utilize technologies to enhance the choices we make when it comes to substantial investments.
Dean De Cock used this data as a replacement for the traditional Boston Housing Dataset in regression classes. His original study forms the basis of the dataset, though I augmented it with additional records (2580 rows x 82 columns) to help with training the ML models. I used Python in Jupyter Notebooks for most of the technical work in this study and present a short introduction for replicating this study in Dataiku at the end.
Data Cleansing & Pre-Processing
The data wasn’t ready to use as it was because of missing or duplicated data. Several features included a considerable number of null values, and some points were duplicated. The data was cleaned with a few methods, such as dropping certain features due to null percent, imputed medians, and duplicate removal. For a full overview of how I cleaned the data, view the presentation slides or my Github.

For certain linear models, data must be converted into dummy variables to enable the model to interpret the data correctly, while also dropping a column to avoid multicollinearity. I also wanted to test the differences between the different algorithms with respect to dummification and dropping a column, as well as ordinal with label encoding. I separate my testing into three data transformations: ordinal, dummified with a column dropped, and dummified with no column dropped.

In addition to each data transformation, I wanted to see the influence of outliers on the future models. To that end, I separated each data transformation into three ways of filtering outliers.
“3 * IQR” -> removes any outliers past the IQR range multiplied by 3, a typical removal is 1.5, but out of personal judgment I went with 3.
“All Outliers” -> this is the base case compared to “3 * IQR” and “Only Normal” neither of these methods are performed and we just use the original data with no outliers removed after the cleansing and data transformations.
“Only Normal” -> at the recommendation of Dean De Cock in his research, we only keep the houses that were sold as a normal sale condition. This filters out sales such as foreclosures, family deals, etc.
Below is the visualization of the data transformations and outlier filtering process described above:

Feature Engineering
Feature engineering is an important piece in capturing more variables from data and that could contribute to the accuracy of the models. Some of the features I added into the project are:
- ‘TotalHouseSF’, which is a combination of 1st, 2nd and basement square footage.
- ‘QualityOutdoorSF’, which is a combination of deck and porch square footage.
- ‘TotalBathroomCount’, creating a total of half bathrooms as 0.5 and full bathrooms as 1.0.
Additional features applied can be found in my presentation slides and GitHub.
Model Building
The algorithms and regression techniques used in the study include MLR, Lasso, Ridge, ElasticNet, GradientBoosting, RandomForest, XGBoost, and CatBoost. The models were created from the previous processes in addition to tuned models, which came out to more than 84 models. A quick snapshot of the scores dataframe below shows scoring was done with 5 K-Fold means of R Squared and RMSE values.


Hyperparameter Tuning
If you noticed in the previous figures, I included some models that include ‘_tuned’. In order to achieve higher scoring models, I utilized hyperparameter tuning. AWS explains its function this way:
Hyperparameters directly control model structure, function, and performance. Hyperparameter tuning allows data scientists to tweak model performance for optimal results. This process is an essential part of machine learning, and choosing appropriate hyperparameter values is crucial for success.
I explored two packages for this purpose: scikit-learn’s GridSearchCV and Optuna.

I ultimately found that, for XGBoost in my data, Optuna performed better than GridSearchCV with a higher 5 K-Fold Mean R Squared, lower RMSE, and quicker run time. Optuna utilizes an adaptive Bayesian search to find its optimal parameters and has built in visualizations. If you intend to use Optuna for your models, just bear in mind that it is only compatible with Python and is not a part of the scikit-learn package.
Below is a quick view of a few visualizations Optuna provides on its hyperparameter importances:


Stacked Model
“The whole is greater than the sum of its parts.” -Aristotle

I explored further methods of enhancing the predictions. A stacked ensemble model is a combination of models that can utilize the various predictive strengths in a single model. For this study, I took my best two performers alongside a Ridge model; the final estimator was LinearRegression.
Not only did this achieve better scores in R Squared and RMSE, it also allowed me to use the strengths of my three different models in a single model. The pros of this approach include flexibility, predictive strength, and a robust solution. But it’s not without its drawbacks. The cons of using an ensemble include low interpretability and complexity.
Best vs. Explainable Model

Above are the differences in scoring between the two models. To achieve the highest possible predictability, I would opt for a variation of the stacked model. I figure this would be best in a situation where I don’t specifically need to speak to the importances of each individual feature, such as a user application, and am just looking for a quick housing price.
For the purposes of assisting individual home buyers or house flippers, I would opt for the more explainable model. In other words, I would be willing to sacrifice some predictability for better interpretability. I believe in that situation it would be better and easier to explain the importance of specific features to those who aren’t well-versed in the technical aspects of data modeling. For these reasons, I chose the more explainable model for this project.
XGBoost, which was tuned with Optuna, transformed ordinally, with a normal sale condition only. Although the ordinal model performed a bit worse than others, it is easier to explain the features when they are combined rather than split. I chose the normal sales condition both because it performed well and is probably in line with what someone would encounter. A family sale, foreclosure, or anything similar would be handled as a special case.

In the graph above on the left, you can see the Predicted Values vs. Actual Values. In a home buying advisory, I would initially lean towards those values furthest under the red dotted line, as the predicted values are lower than the actual values of the properties. Similarly, the Residual plot on the right, shows the values of residual error in respect to the red dotted line.
Feature Evaluation
To evaluate the XGBoost model, I installed and used the, SHapley Additive exPlanations (SHAP), a game theoretic approach to explain the output of any machine learning model. You can read more about the specifics of SHAP here.

In the above model, you can see that OverallQual had the most impact on SalePrice. Overall quality would be our most important feature here, and the goal would be to increase it. This would lead us to the next step of our analysis, identifying which features have the greatest impact on our model. In the visualizations below, the further to the red would increase positive value and vice versa for the blue values.

Here is an example of using SHAP value dependence to view a plausible business case derived from the model. On the left, we see a value somewhere around 2750 square feet in unfinished basement land around -$14,000 value for the house. Whereas if we adjusted the basement to be finished with around the same amount of square footage, it increases to around $21,000. This could be a potential $35,000 swing in SalePrice from fixing up the basement!
For a deeper dive on specific feature relationships like this, take a look at the presentation. More are displayed in my GitHub.
Recommendations
The primary focus is on increasing Overall Quality for the highest change in price. Additional improvements that correlate with higher sale prices include the following:
- Increasing total house square footage (through 1st, 2nd, and basement floor renovations)
- Increasing the greater living area
- Finishing an unfinished basement can create a considerable change in price
Prices can also increase as a result of adding in sought-after features For example::
- Every 0.5 bathroom after 2 will give you more value, until 5 bathrooms
- A fireplace is better than no fireplace, and 3 is better than 1
- Adding a garage, porches, or decks (over thresholds) also increase sale prices.
It’s important to always assess the cost-benefit of the improvements you undertake with the goal of selling the home. If the improvement costs less than the model's prediction ($907) per 100 sq ft of additions to the greater living area, great!
If it costs more, you have to answer these questions:
- Is it part of a larger project that will add total sqft?
- Will this increase my overall quality?
- Am I including a bathroom that brings me over 2 to 2.5?
Limitations
- Not knowing how overall quality is measured. (Or some of the other features)
- Amount of data may not be optimal for some algorithms.
- Housing data is just in Ames, Iowa for the given time period.
Future works
- Using the PID to pull in pictures or other information to assist in predicting, De Cock mentions that this may be possible with certain sources.
- Compared to other housing markets, replicate the work and see differences elsewhere.
- Application or interface for a housing agent or home buyer to view these findings interactively.
Replicating the work with Dataiku
(I am not affiliated with Dataiku, and I used a free trial to replicate my work in this project on my own)
To show the importance of using tools to streamline tasks, I demonstrate a couple of features of Dataiku that enabled me to do the same project within a few hours. These images may be clearer in the presentation, but Dataiku enables users to visually identify null values in the data and replace them easily while maintaining an audit trail. I did half of the cleaning with Dataiku features and no-code features. I also copy/pasted the code from my notebook to demonstrate the coding potential inside of Dataiku as well.


After the data was cleaned and filtered for outliers, I used the AutoML capabilities of Dataiku to reinforce my results. I was pleasantly surprised to find that the individual models performed near the level of the models in my manual hyperparameter tuning. Although they had slightly lower scores, I didn’t spend much time tuning or testing different ways to create newer models. This reassured me that gradient boosting algorithms for the models were the best performers.

Links / Accreditations
- Feature Image Created with Image Generator GPT by NAIF J ALOTAIBI on ChatGPT4
- Google Slides Deck (Last slide has sources used)
- GitHub