A Data-driven Dashboard Application for House Flipping

Matthew Fay, Moritz Becker, Hao-Wei and Oren Ross

Posted on Nov 5, 2021

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Try the app here | GitHub repo | LinkedIn: Moritz Hao-Wei Matthew Oren

Background and Business Objectives

U.S. house prices have skyrocketed in 2021, with data from April-July 2021 marking four consecutive months of record high year-on-year home value appreciation. In any competitive real estate market, commercial real estate developers need to be able to quickly identify which opportunities to pursue.

This includes understanding:

How features contribute to value and what features should be included in a home to achieve a target sale price
Which houses offer high-return renovation or flipping potential (e.g. by adding more square footage, improving house quality, finishing the basement, etc)

With this project, we built a dashboard application to help real estate developers evaluate properties. The app allows virtual modification of a given house to assess its potential as a flipping candidate, utilizing machine learning algorithms to predict the house’s sale price after renovation.

Data Sources and Geospatial Features

For this project, we investigated house sales in the city of Ames, Iowa, a college town home to Iowa State University. The dataset used in this project was collected by the City of Ames Assessor and contains house features and sale prices from 2006 to 2010. It contains ~2600 houses with 54 features, including:

Square footages of many parts of the house
Condition and quality ratings
Types of materials/finish
Miscellaneous features such as pools, fireplaces, etc

Though the dataset included the neighborhood each house is in, we integrated more granular geospatial features using OpenStreetMap. Addresses for all properties were converted to latitude-longitude coordinates using geopython and the Google Maps API. Coordinates for 131 different classes of OpenStreetMap points of interest (POIs) in Ames were also obtained, spanning a wide variety of natural, cultural, and infrastructural features. The numbers of each class of OpenStreetMap POIs within a 1-mile radius of each property were then calculated.

In the image above, the house (blue pin) has five church POIs (green icons) within a 1-mile radius; this house is assigned a “Churches” feature of 5.

Data Cleaning

Houses sold under abnormal conditions (foreclosure, auction, sold between family members) were excluded due to heavily skewed sale prices. These sales were not representative of the sale prices of properties with comparable features.

Houses with 0 bedrooms above ground were also excluded. Many of these houses had [total square footage = basement square footage = first floor square footage], leading us to believe they were basements being sold as separate units.

Finally, properties with non-residential zoning classifications were excluded, leaving a final dataset of ~2500 houses. Briefly, other data cleaning included handling missing values, string processing, and combining values of some categorical features due to similarity or to small sample size.

Feature Selection and Engineering

A major goal of feature engineering in this project was mitigation of multicollinearity within the dataset, for both modeling and business reasons. For example, we found that more expensive homes tended to have:

Larger square footages
Better amenities
Better materials/finishes
Higher quality/condition ratings

Many of these features increased collinearly. But what adds more value to a home, adding more space or improving the amenities?

To best inform dashboard users on which renovations maximize return, our models needed to capture the independent contribution to sale price from each feature.

Our initial approach was to begin with the full set of 180+ features and to remove unimportant features using Lasso regression and CatBoost, a gradient boosting model. We found this method unreliable due to duplicate information between many features in the dataset.

For example, Lasso regression might assign a high coefficient to “garage area”, suggesting that garage area has an important contribution to home value. However, removal of garage area from the model would cause negligible change in performance and the model would simply assign a high coefficient to “garage # of cars” instead. There were similar cases to this in many categories of the house features.

Initial Approach From Data Gathered

With this in mind, we changed our approach and built a feature set from the ground up with 3 primary considerations:

Inclusion of core features traditionally used in real estate and property development (ex. total square footage, number of bedrooms, baths, half baths, house age). Some of these could be dropped with minimal accuracy loss, but we anticipate that dashboard users will expect to see and use this information.
Target accuracy of 94%, corresponding to mean error of $10k. We felt this was acceptable error for data with interquartile range of $130k to $211k.
Minimization of multicollinearity as measured by variance inflation factor (VIF).

Of particular note, all quality and condition ratings were generally strongly correlated with "Overall Quality". We scaled all ratings features from 0 to 1 and expressed them as differences from Overall Quality to reflect more granular information on how various parts of the home (the condition and quality of the kitchen or exterior, for example) independently influence sale price.

From the original dataset, 21 features were included or engineered. OpenStreetMap POIs were then selected via stepwise addition to maximize accuracy in a CatBoost model, with the top 15 POIs included as features in the final dataset. The final dataset had ~2500 properties and 36 features, with no feature having VIF > 5.

Data Gathered From Model Tuning and Selection

We trained 5 different models in order to compare performance and behavior:

Lasso regression

Straightforward linear model with intuitive feature importance output
Useful for removing poorly predictive features

CatBoost (gradient boosting)

Can capture nonlinear relationships between features and the target
Easily interpretable and provides feature importance output

Support vector regression

Well-suited to small datasets with high dimensionality
Able to handle both linear and nonlinear relationships
Less interpretable, no feature importance output for nonlinear kernels
We trained 2 support vector machines, one with linear kernel and one with Gaussian

Stacking ensemble

Penalized linear model determines the optimal weights of the 4 sub-models' predictions in a weighted average

Data was divided using a 75-25 train-test split.

Hyperparameters were tuned by gridsearch with shuffled 5-fold cross validation on the train set.

Other tree-based methods were explored; CatBoost was used because it outperformed random forest and XGBoost on this dataset.

Finally, a stacking ensemble model was trained on the outputs of the other 4 models. Ridge regression (with tuning by gridsearch and 5-fold cross validation) was used to find the optimal weight for each sub-model’s output within the ensemble.

Test set R-squared scores for the 5 models are shown below.

Model	Test Set Accuracy
Lasso	93.4
Linear SVR	93.5
CatBoost	94.1
Gaussian SVR	94.5
Stacking ensemble	95.1

CatBoost and Gaussian SVR gave slightly higher accuracy than Lasso and linear SVR, suggesting that there may be nonlinear relationships between some of the features and sale price that the linear models could not capture as well. Though its accuracy was comparable to CatBoost’s, Gaussian SVR was chosen as the final model for the house-flipping tool, as its predictions were more readily interpretable.

As depicted above, Gaussian SVR (blue) predicted smooth changes in sale price given changes in features, while CatBoost (green) predicted large instantaneous jumps in sale price at certain thresholds of many features. While the ensemble model outperformed any of the 4 sub-models, its predictions also displayed the nonsmooth behavior from CatBoost, though to a lesser degree.

Dashboard Application

The dashboard application is written in Python using Plotly Dash and GeoPandas. It has 3 main pages:

Filtering and Mapping
House Flipping Tool
Feature Exploration

The pages of the app follow a general 3-step approach to house flipping:

Find a suitable candidate house
Consider changes to make
Understand how those changes will influence return

The application aims to reduce bias from the individual in predicting the potential value of the updated home and facilitate better-informed decisions about whether to invest.

Page 1: Filtering and Mapping

This page gives an overview of the housing market in Ames, Iowa. It includes a map of the entire area of Ames, with the option to display the homes and various OpenStreetMap POIs concurrently.

Using the POI overlays, the user is able to visualize the makeup of different neighborhoods and parts of the city. Below, we can see that parks and recreational areas (green tree icons) in Ames are distributed mostly in the Northern and Western areas of the city and seem to be concentrated around the residential areas.

The user can filter the dataset using the sliders. All the houses that meet the filter criteria are shown on the map and listed in the dataframe below with more specific information. The map and dataframe update dynamically. The current filters are limited to price, bedrooms, bathrooms, and total square footage, but could be easily modified to include any of the other features.

Users can select a specific home to export to the house-flipping tool by clicking the bubble on the left of its row in the dataframe.

Page 2: House-flipping Tool

This page allows users to consider various changes they might make to the home and evaluate the predicted returns for different changes.

Data Analysis

The table on left displays attributes of the unmodified “current” home, with its actual sale price at top.

In the center of the page are sliders and dropdowns for various features that the user can virtually modify. The virtually modifiable features are limited to changes that the user can reasonably make to the home, such as qualities/finishes, square footage, number of beds/baths/half-baths, or adding a pool. It is important to note that some aspects of the home, such as lot size or building type, are not easily changeable. While these features can be helpful for some forms of residential flipping (i.e. merging multiple parcels of land), this project focused on flipping single family homes.

The table on right represents the modified “future” home, with its predicted price at top. The features listed in the table update dynamically as the user moves the sliders/dropdowns in the middle section. The predicted price also updates dynamically as the modified house is passed to the Gaussian SVR model described previously. This allows the user to better understand how much value might be added to the house by their selected changes.

In the example image above, the home's predicted value has increased by ~$25k as a result of finishing the 403 square feet of low-quality basement to high-quality basement and improving the overall condition of the house from 0.33 to 0.5.

This tool does not account for the cost of making the modifications selected by the user; it only predicts the value of the home after modification. Users must employ their domain knowledge on the costs of renovation to evaluate profitability.

Page 3: Feature Exploration Data

To ensure that the machine learning algorithms used by the application are not simply a black box for users, we included a page for exploration of how different features impact value in the model’s predictions. This page can be used to:

Understand changes in predicted value given a change in features
Decide which features should be changed to maximize value
Evaluate how much to change those features
Determine whether improving a given feature has diminishing returns

Data on House Pricing

Scatterplot Datas

All houses meeting the filter criteria from page 1 are shown in a scatterplot and in the dataframe at the bottom of the page.

The scatterplot shows GrLivArea (general living area, i.e. total square footage) vs sale price, colored by neighborhood, with more information for each house shown on mouse-over.

Price vs Quality

Clicking on a house in the scatterplot initializes 2 bivariate EDA graphs for that house. These EDA graphs allow visualization of the effects of any continuous feature on sale price, overlaid with any categorical feature. The enlarged circles show the house’s current attributes. Users can also view predictions from the 5 different algorithms we trained and/or compare the predictions from multiple algorithms. X denotes the ensemble model.

The graph above shows Gaussian SVR predictions of sale price vs high-quality basement area, overlaid with mason veneer type. The enlarged circle shows that the selected house currently has a stone mason veneer and ~2160 square feet of high-quality basement. For this house, the model predicts that stone is the most valuable mason veneer type, followed by brick and finally by no mason veneer. Further, it predicts a linear relationship between sale price and high-quality basement area without diminishing returns.

These graphs allow users to explore the feature set interactively and generate their own insights on what modifications might be best to make to a candidate house.

Conclusions and Future Directions

Major takeaways of this project include:

Many features in this housing dataset (and likely others) are highly interdependent and multicollinear.
Understanding the distribution of features across the dataset is useful for identifying trends in the real estate market, but allowing users to investigate the impact of feature changes on specific individual homes provides more actionable insight for house-flipping.
House sale prices can be modeled accurately using a relatively small and linearly independent set of features.
The dashboard application allows visualization of a given feature’s effect on sale price with everything else held constant to best provide insights about which renovations add most value to the home.

Future directions include:

Improving the interface and user instructions in the application.
Obtaining a larger dataset that spans a wider geographical area.
Incorporating simple models/estimates of renovation cost for the modifiable features so that the application can directly generate insights on profitability.

Thank you for reading about our work! If you are interested in our other projects, please check out our author pages.

About Authors

Matthew Fay

Matthew is a Data Science Fellow with a BS in chemistry from UNC-Chapel Hill. After 3 years studying towards a dual MD-PhD and researching antibody engineering, he pivoted to pursue data science and analytics. He has a passion...

View all posts by Matthew Fay >

Moritz Becker

Strategy Consultant, with a passion for creating impact from data-driven business insights. Originally from Germany, I have been working in the US as an Engagement Manager in Strategy Consulting for over 3 years. My projects at work focus...

View all posts by Moritz Becker >

Hao-Wei

Hao-Wei is an NYC Data Science Acadamy Fellow with master's degrees in Communication Engineering and Mathematics from National Taiwan University, and a Ph. D. degree in Mathematics from the Pennsylvania State University. With a broad experience ranging from...

View all posts by Hao-Wei >

Oren Ross

View all posts by Oren Ross >

Machine Learning

Beware of Feature Importance for Business Decisions

Student Works

Power of a Predictive Model for Ames, Iowa Housing

Capstone

LendingClub Grade Optimization

Data Visualization

Ames Iowa Home Sale Prediction