A Data-driven Dashboard Application for House Flipping
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Try the app here | GitHub repo | LinkedIn: Moritz Hao-Wei Matthew Oren
Background and Business Objectives
U.S. house prices have skyrocketed in 2021, with data from April-July 2021 marking four consecutive months of record high year-on-year home value appreciation. In any competitive real estate market, commercial real estate developers need to be able to quickly identify which opportunities to pursue.
This includes understanding:
- How features contribute to value and what features should be included in a home to achieve a target sale price
- Which houses offer high-return renovation or flipping potential (e.g. by adding more square footage, improving house quality, finishing the basement, etc)
With this project, we built a dashboard application to help real estate developers evaluate properties. The app allows virtual modification of a given house to assess its potential as a flipping candidate, utilizing machine learning algorithms to predict the house’s sale price after renovation.
Data Sources and Geospatial Features
For this project, we investigated house sales in the city of Ames, Iowa, a college town home to Iowa State University. The dataset used in this project was collected by the City of Ames Assessor and contains house features and sale prices from 2006 to 2010. It contains ~2600 houses with 54 features, including:
- Square footages of many parts of the house
- Condition and quality ratings
- Types of materials/finish
- Miscellaneous features such as pools, fireplaces, etc
Though the dataset included the neighborhood each house is in, we integrated more granular geospatial features using OpenStreetMap. Addresses for all properties were converted to latitude-longitude coordinates using geopython and the Google Maps API. Coordinates for 131 different classes of OpenStreetMap points of interest (POIs) in Ames were also obtained, spanning a wide variety of natural, cultural, and infrastructural features. The numbers of each class of OpenStreetMap POIs within a 1-mile radius of each property were then calculated.
In the image above, the house (blue pin) has five church POIs (green icons) within a 1-mile radius; this house is assigned a “Churches” feature of 5.
Data Cleaning
Houses sold under abnormal conditions (foreclosure, auction, sold between family members) were excluded due to heavily skewed sale prices. These sales were not representative of the sale prices of properties with comparable features.
Houses with 0 bedrooms above ground were also excluded. Many of these houses had [total square footage = basement square footage = first floor square footage], leading us to believe they were basements being sold as separate units.
Finally, properties with non-residential zoning classifications were excluded, leaving a final dataset of ~2500 houses. Briefly, other data cleaning included handling missing values, string processing, and combining values of some categorical features due to similarity or to small sample size.
Feature Selection and Engineering
A major goal of feature engineering in this project was mitigation of multicollinearity within the dataset, for both modeling and business reasons. For example, we found that more expensive homes tended to have:
- Larger square footages
- Better amenities
- Better materials/finishes
- Higher quality/condition ratings
Many of these features increased collinearly. But what adds more value to a home, adding more space or improving the amenities?
To best inform dashboard users on which renovations maximize return, our models needed to capture the independent contribution to sale price from each feature.
Our initial approach was to begin with the full set of 180+ features and to remove unimportant features using Lasso regression and CatBoost, a gradient boosting model. We found this method unreliable due to duplicate information between many features in the dataset.
For example, Lasso regression might assign a high coefficient to “garage area”, suggesting that garage area has an important contribution to home value. However, removal of garage area from the model would cause negligible change in performance and the model would simply assign a high coefficient to “garage # of cars” instead. There were similar cases to this in many categories of the house features.
Initial Approach From Data Gathered
With this in mind, we changed our approach and built a feature set from the ground up with 3 primary considerations:
- Inclusion of core features traditionally used in real estate and property development (ex. total square footage, number of bedrooms, baths, half baths, house age). Some of these could be dropped with minimal accuracy loss, but we anticipate that dashboard users will expect to see and use this information.
- Target accuracy of 94%, corresponding to mean error of $10k. We felt this was acceptable error for data with interquartile range of $130k to $211k.
- Minimization of multicollinearity as measured by variance inflation factor (VIF).
Of particular note, all quality and condition ratings were generally strongly correlated with "Overall Quality". We scaled all ratings features from 0 to 1 and expressed them as differences from Overall Quality to reflect more granular information on how various parts of the home (the condition and quality of the kitchen or exterior, for example) independently influence sale price.
From the original dataset, 21 features were included or engineered. OpenStreetMap POIs were then selected via stepwise addition to maximize accuracy in a CatBoost model, with the top 15 POIs included as features in the final dataset. The final dataset had ~2500 properties and 36 features, with no feature having VIF > 5.
Data Gathered From Model Tuning and Selection
We trained 5 different models in order to compare performance and behavior:
Lasso regression
- Straightforward linear model with intuitive feature importance output
- Useful for removing poorly predictive features
CatBoost (gradient boosting)
- Can capture nonlinear relationships between features and the target
- Easily interpretable and provides feature importance output
Support vector regression
- Well-suited to small datasets with high dimensionality
- Able to handle both linear and nonlinear relationships
- Less interpretable, no feature importance output for nonlinear kernels
- We trained 2 support vector machines, one with linear kernel and one with Gaussian
Stacking ensemble
- Penalized linear model determines the optimal weights of the 4 sub-models' predictions in a weighted average
Data was divided using a 75-25 train-test split.
Hyperparameters were tuned by gridsearch with shuffled 5-fold cross validation on the train set.
Other tree-based methods were explored; CatBoost was used because it outperformed random forest and XGBoost on this dataset.
Finally, a stacking ensemble model was trained on the outputs of the other 4 models. Ridge regression (with tuning by gridsearch and 5-fold cross validation) was used to find the optimal weight for each sub-model’s output within the ensemble.
Test set R-squared scores for the 5 models are shown below.
Model | Test Set Accuracy |
Lasso | 93.4 |
Linear SVR | 93.5 |
CatBoost | 94.1 |
Gaussian SVR | 94.5 |
Stacking ensemble | 95.1 |
CatBoost and Gaussian SVR gave slightly higher accuracy than Lasso and linear SVR, suggesting that there may be nonlinear relationships between some of the features and sale price that the linear models could not capture as well. Though its accuracy was comparable to CatBoost’s, Gaussian SVR was chosen as the final model for the house-flipping tool, as its predictions were more readily interpretable.
As depicted above, Gaussian SVR (blue) predicted smooth changes in sale price given changes in features, while CatBoost (green) predicted large instantaneous jumps in sale price at certain thresholds of many features. While the ensemble model outperformed any of the 4 sub-models, its predictions also displayed the nonsmooth behavior from CatBoost, though to a lesser degree.
Dashboard Application
The dashboard application is written in Python using Plotly Dash and GeoPandas. It has 3 main pages:
- Filtering and Mapping
- House Flipping Tool
- Feature Exploration
The pages of the app follow a general 3-step approach to house flipping:
- Find a suitable candidate house
- Consider changes to make
- Understand how those changes will influence return
The application aims to reduce bias from the individual in predicting the potential value of the updated home and facilitate better-informed decisions about whether to invest.
Page 1: Filtering and Mapping
This page gives an overview of the housing market in Ames, Iowa. It includes a map of the entire area of Ames, with the option to display the homes and various OpenStreetMap POIs concurrently.
Using the POI overlays, the user is able to visualize the makeup of different neighborhoods and parts of the city. Below, we can see that parks and recreational areas (green tree icons) in Ames are distributed mostly in the Northern and Western areas of the city and seem to be concentrated around the residential areas.
The user can filter the dataset using the sliders. All the houses that meet the filter criteria are shown on the map and listed in the dataframe below with more specific information. The map and dataframe update dynamically. The current filters are limited to price, bedrooms, bathrooms, and total square footage, but could be easily modified to include any of the other features.
Users can select a specific home to export to the house-flipping tool by clicking the bubble on the left of its row in the dataframe.
Page 2: House-flipping Tool
This page allows users to consider various changes they might make to the home and evaluate the predicted returns for different changes.
Data Analysis
The table on left displays attributes of the unmodified “current” home, with its actual sale price at top.
In the center of the page are sliders and dropdowns for various features that the user can virtually modify. The virtually modifiable features are limited to changes that the user can reasonably make to the home, such as qualities/finishes, square footage, number of beds/baths/half-baths, or adding a pool. It is important to note that some aspects of the home, such as lot size or building type, are not easily changeable. While these features can be helpful for some forms of residential flipping (i.e. merging multiple parcels of land), this project focused on flipping single family homes.
The table on right represents the modified “future” home, with its predicted price at top. The features listed in the table update dynamically as the user moves the sliders/dropdowns in the middle section. The predicted price also updates dynamically as the modified house is passed to the Gaussian SVR model described previously. This allows the user to better understand how much value might be added to the house by their selected changes.
In the example image above, the home's predicted value has increased by ~$25k as a result of finishing the 403 square feet of low-quality basement to high-quality basement and improving the overall condition of the house from 0.33 to 0.5.
This tool does not account for the cost of making the modifications selected by the user; it only predicts the value of the home after modification. Users must employ their domain knowledge on the costs of renovation to evaluate profitability.
Page 3: Feature Exploration Data
To ensure that the machine learning algorithms used by the application are not simply a black box for users, we included a page for exploration of how different features impact value in the model’s predictions. This page can be used to:
- Understand changes in predicted value given a change in features
- Decide which features should be changed to maximize value
- Evaluate how much to change those features
- Determine whether improving a given feature has diminishing returns
Data on House Pricing
Scatterplot Datas
All houses meeting the filter criteria from page 1 are shown in a scatterplot and in the dataframe at the bottom of the page.
The scatterplot shows GrLivArea (general living area, i.e. total square footage) vs sale price, colored by neighborhood, with more information for each house shown on mouse-over.
Price vs Quality
Clicking on a house in the scatterplot initializes 2 bivariate EDA graphs for that house. These EDA graphs allow visualization of the effects of any continuous feature on sale price, overlaid with any categorical feature. The enlarged circles show the house’s current attributes. Users can also view predictions from the 5 different algorithms we trained and/or compare the predictions from multiple algorithms. X denotes the ensemble model.
The graph above shows Gaussian SVR predictions of sale price vs high-quality basement area, overlaid with mason veneer type. The enlarged circle shows that the selected house currently has a stone mason veneer and ~2160 square feet of high-quality basement. For this house, the model predicts that stone is the most valuable mason veneer type, followed by brick and finally by no mason veneer. Further, it predicts a linear relationship between sale price and high-quality basement area without diminishing returns.
These graphs allow users to explore the feature set interactively and generate their own insights on what modifications might be best to make to a candidate house.
Conclusions and Future Directions
Major takeaways of this project include:
- Many features in this housing dataset (and likely others) are highly interdependent and multicollinear.
- Understanding the distribution of features across the dataset is useful for identifying trends in the real estate market, but allowing users to investigate the impact of feature changes on specific individual homes provides more actionable insight for house-flipping.
- House sale prices can be modeled accurately using a relatively small and linearly independent set of features.
- The dashboard application allows visualization of a given feature’s effect on sale price with everything else held constant to best provide insights about which renovations add most value to the home.
Future directions include:
- Improving the interface and user instructions in the application.
- Obtaining a larger dataset that spans a wider geographical area.
- Incorporating simple models/estimates of renovation cost for the modifiable features so that the application can directly generate insights on profitability.
Thank you for reading about our work! If you are interested in our other projects, please check out our author pages.