A Dashboard Application for Data-driven House Flipping

Try the app here | GitHub repo | LinkedIn:   Moritz   Hao-Wei   Matthew   Oren

Background and Business Objectives


U.S. house prices have skyrocketed in 2021, with April-July 2021 marking four consecutive months of record high year-on-year home value appreciation. In any competitive real estate market, commercial real estate developers need to be able to quickly identify which opportunities to pursue.

This includes understanding:

  • How features contribute to value and what features should be included in a home to achieve a target sale price
  • Which houses offer high-return renovation or flipping potential (e.g. by adding more square footage, improving house quality, finishing the basement, etc)

With this project, we built a dashboard application to help real estate developers evaluate properties. The app allows virtual modification of a given house to assess its potential as a flipping candidate, utilizing machine learning algorithms to predict the house’s sale price after renovation. 

 

Data Sources


For this project, we investigated house sales in the city of Ames, Iowa, a college town home to Iowa State University. The dataset used in this project was collected by the City of Ames Assessor and contains house features and sale prices from 2006 to 2010. It contains ~2600 houses with 54 features, including:

  • Square footages of many parts of the house
  • Condition and quality ratings
  • Types of materials/finish
  • Miscellaneous features such as pools, fireplaces, etc

Though the dataset included the neighborhood each house is in, we integrated more granular geospatial features using OpenStreetMap. Addresses for all properties were converted to latitude-longitude coordinates using geopython and the Google Maps API. Coordinates for 131 different classes of OpenStreetMap points of interest (POIs) in Ames were also obtained, spanning a wide variety of natural, cultural, and infrastructural features. The numbers of each class of OpenStreetMap POIs within a 1-mile radius of each property were then calculated. 

In the image above, the house (blue pin) has five church POIs (green icons) within a 1-mile radius; this house is assigned a “Churches” feature of 5.

 

Data Cleaning


Houses sold under abnormal conditions (foreclosure, auction, etc) or sold between family members were excluded due to heavily skewed sale prices. These sales were not representative of the sale prices of properties with comparable features.

Houses with 0 bedrooms above ground were also excluded. Many of these houses had [total square footage = basement square footage = first floor square footage], leading us to believe they were basements being sold as separate units.

Finally, properties with non-residential zoning classifications were excluded, leaving a final dataset of ~2500 houses. Briefly, other data cleaning included handling missing values, string processing, and combining similar values of some categorical features.

 

Feature Selection and Engineering


Our initial approach was to begin with the full set of 130+ features and to use Lasso regression and CatBoost, a gradient boosting model, to remove unimportant features. We found this method unreliable due to the strong multicollinearity between many features in the dataset.

For example, Lasso regression might assign a high coefficient to “garage area”, suggesting that garage area has an important contribution to home value. However, removal of garage area from the model would cause negligible change in accuracy and the model would simply assign a high coefficient to “garage # of cars” instead. There were similar cases to this in many categories of the house features.

With this in mind, we changed our approach and built a feature set from the ground up with 3 primary considerations:

  • Inclusion of core features traditionally used in real estate and property development (ex. total square footage, number of bedrooms, baths, half baths, house age). Some of these could be dropped with minimal accuracy loss, but we anticipate that dashboard users will expect to see and use this information.
  • Target accuracy of 94%, corresponding to a mean error of ~$10k. We felt this was acceptable error for data with interquartile range of $130k to $211k.
  • Minimization of multicollinearity as measured by variance inflation factor (VIF).

Of particular note, all quality and condition ratings were generally strongly correlated with "Overall Quality". We scaled all ratings features from 0 to 1 and expressed them as differences from Overall Quality to reflect more granular information on how various parts of the home (the condition and quality of the kitchen or exterior, for example) independently influence sale price.

From the original dataset, 21 features were included or engineered. OpenStreetMap POIs were then selected via stepwise addition to maximize accuracy in a CatBoost model, with the top 15 POIs included as features in the final dataset. The final dataset had ~2500 properties with 36 features.

 

Model Tuning and Selection


We trained 5 different models in order to compare performance and behavior:

  • Lasso regression
  • CatBoost
  • Support vector regression (SVR) with linear kernel
  • SVR with gaussian kernel
  • Stacking ensemble model

Hyperparameters were tuned by gridsearch with shuffled 5-fold cross validation.

Other tree-based methods were explored; CatBoost was used because it outperformed random forest and XGBoost on this dataset.

Gaussian SVR was approximated using sklearn’s Kernel Ridge model, which uses a smooth ridge loss function instead of the epsilon-insensitive loss function to allow significantly faster fitting than SVR with very similar behavior. Gaussian Kernel Ridge was chosen over true Gaussian SVR because time constraints on the project necessitated reducing training times.

Finally, a stacking ensemble model was trained on the outputs of the other 4 models. Ridge regression (with tuning by gridsearch and 5-fold cross validation) was used to find the optimal weight for each sub-model’s output within the ensemble.

Cross-validation R-squared scores for the 5 models are shown below.

Model Out-of-sample accuracy
Lasso                  93.4
CatBoost                  94.1
Linear SVR                  93.5
Gaussian SVR                  94.5
Stacking ensemble                  95.1

 

CatBoost and Gaussian SVR gave slightly higher accuracy than Lasso and linear SVR, suggesting that there may be nonlinear relationships between some of the features and house sale price that the linear models could not capture as well. Though its accuracy was comparable to CatBoost’s, Gaussian SVR was chosen as the final model for the house-flipping tool, as its predictions were more readily interpretable.

As depicted above, Gaussian SVR (blue) predicted smooth changes in sale price given changes in features, while CatBoost (green) predicted large instantaneous jumps in sale price at certain thresholds of many features. The ensemble model outperformed any of the 4 original models, but its predictions also displayed the nonsmooth behavior from CatBoost, though to a lesser degree.

 

Dashboard Application


The dashboard application is written in Python using Plotly Dash and GeoPandas. It has 3 main pages:

  • Filtering and Mapping
  • House Flipping Tool
  • Feature Exploration

The pages of the app follow a general 3-step approach to house flipping:

  • Find a suitable candidate house
  • Consider changes to make
  • Understand how those changes will influence return

The application aims to reduce bias from the individual in predicting the potential value of the updated home and facilitate better-informed decisions about whether to invest. 

 

Page 1: Filtering and Mapping


This page gives an overview of the housing market in Ames, Iowa. It includes a map of the entire area of Ames, with the option to display the homes and various OpenStreetMap POIs concurrently. 

Using the POI overlays, the user is able to visualize the makeup and characteristics of different neighborhoods and parts of the city. Below, we can see that parks and recreational areas (green tree icons) in Ames are distributed mostly in the Northern and Western areas of the city and seem to be concentrated around the residential areas.

The user can filter the dataset using the sliders. All the houses that meet the filter criteria are shown on the map and listed in the dataframe below the map with more specific information. The map and dataframe update dynamically. The current filters are limited to price, bedrooms, bathrooms, and total square footage, but could be easily modified to include any of the other features. 

Users can select a specific home to export to the house-flipping tool by clicking the bubble on the left of its row in the dataframe.

 

Page 2: House-flipping Tool


This page allows users to consider various changes they might make to the home and evaluate the predicted returns for different changes.

The table on left displays attributes of the unmodified “current” home, with its actual sale price at top.

In the center of the page are sliders and dropdowns for various features that the user can virtually modify. The virtually modifiable features are limited to changes that the user can reasonably make to the home, such as qualities/finishes, square footage, number of beds/baths/half-baths, or adding a pool. It is important to note that some aspects of the home, such as lot size or building type, are not easily changeable. While these features can be helpful for some forms of residential flipping (i.e. merging multiple parcels of land), this project focused on flipping single family homes.

The table on right represents the modified “future” home, with its predicted price at top. The features listed in the table update dynamically as the user moves the sliders/dropdowns in the middle section. The predicted price also updates dynamically as the modified house is passed to the Gaussian SVR model described previously. This allows the user to better understand how much value might be added to the house by their selected changes.

In the example image above, the home's predicted value has increased by ~$25k as a result of finishing the 403 square feet of low-quality basement to high-quality basement and improving the overall condition of the house from 0.33 to 0.5.

This tool does not account for the cost of making the modifications selected by the user; it only predicts the value of the home after modification. Users must employ their domain knowledge on the costs of renovation to evaluate profitability.

 

Page 3: Feature Exploration


To ensure that the machine learning algorithms used by the application are not simply a black box for users, we included a page for exploration of how different features impact value in the model’s predictions. This page can be used to:

  • Understand the resultant changes in predicted value given a change in features
  • Decide which features should be changed to maximize value
  • Evaluate how much to change those features
  • Determine whether improving a given feature has diminishing returns

All houses meeting the filter criteria from page 1 are shown in a scatterplot and in the dataframe at the bottom of the page.

The scatterplot shows GrLivArea (general living area, aka total square footage) vs sale price, colored by neighborhood, with more information for each house shown on mouse-over. 

Clicking on a house in the scatterplot initializes 2 bivariate EDA graphs for that house. These EDA graphs allow visualization of any continuous feature vs sale price, overlaid with any categorical feature. The enlarged circles show the house’s current attributes. Users can also view predictions from the 5 different algorithms we trained and/or compare the predictions from multiple algorithms. X denotes the ensemble model.

The graph above shows Gaussian SVR predictions of sale price vs high-quality basement area, overlaid with mason veneer type. The enlarged circle shows that the selected house currently has a stone mason veneer and about 2160 square feet of high-quality basement. For this house, the model predicts that stone is the most valuable mason veneer type, followed by brick and finally by no mason veneer. Further, it predicts a linear relationship between sale price and high-quality basement area without diminishing returns.

These graphs allow users to explore the feature set interactively and generate their own insights on what modifications might be best to make to a candidate house.

 

Conclusions and Future Directions


Major takeaways of this project include:

  • Many features in this housing dataset (and likely others) are highly interdependent and multicollinear. For example, larger homes often also have better finishes, more amenities, etc.
  • House sale prices can be modeled accurately using a relatively small set of features.
  • The dashboard application utilizes a feature set engineered to have minimal multicollinearity, allowing visualization of a given feature’s effect on sale price with everything else held constant.
  • Understanding the overall distribution of features across the dataset is useful for identifying trends in the real estate market, but allowing users to investigate the impact of feature changes on specific individual homes provides more actionable insight for house-flipping.

Future directions include:

  • Improving the interface and user instructions in the application.
  • Obtaining a larger dataset that spans a wider geographical area.
  • Finding data and incorporating simple models/estimates of renovation cost for the modifiable features so that the application can directly generate insights on profitability.

Thank you for reading about our work! If you are interested in our other projects, please check out our author pages.

About Authors

Matthew Fay

Matthew is a Data Science Fellow with a BS in chemistry from UNC-Chapel Hill. After 3 years studying towards a dual MD-PhD and researching antibody engineering, he pivoted to pursue data science and analytics. He has a passion...
View all posts by Matthew Fay >

Moritz Becker

Strategy Consultant, with a passion for creating impact from data-driven business insights. Originally from Germany, I have been working in the US as an Engagement Manager in Strategy Consulting for over 3 years. My projects at work focus...
View all posts by Moritz Becker >

Hao-Wei

Hao-Wei is an NYC Data Science Acadamy Fellow with master's degrees in Communication Engineering and Mathematics from National Taiwan University, and a Ph. D. degree in Mathematics from the Pennsylvania State University. With a broad experience ranging from...
View all posts by Hao-Wei >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp