How to increase chances of success when flipping homes?

Bryce Ferraro

Posted on Mar 6, 2024

Introduction

In this project, I am an Iowa-born, real estate flipper and I’m looking for a data-driven model to maximize the return on my investment. To achieve that goal, I need to answer these research questions:

How can we identify undervalued homes using data?
What are the important levers for sale prices, and are any easier/cheaper to optimize?

To answer these essential questions, the following high-level methodology was executed as shown in the graph below:

screenshot-2024-02-26-at-94018-am-617287-QJHWGme1 | Data Science Blog — Figure 1 High level methodology for Ames Project

Data Collection

For this project, Kaggle’s Ames Housing Dataset was used as the data source. This dataset contains 2579 observations and 81 descriptive columns, one of which was “SalePrice,” the response variable for downstream analysis.

Since the data was open-source, it wasn’t necessary to personally collect and structure the results, which saved a significant amount of time and energy. However, it was still necessary to preprocess the data prior to modeling.

Data Preparation

In order to clean and mold the data for modeling, the following data preparation methodology was deployed (Figure 2). In the next few sections, I will elaborate on the individual data preparation steps.

screenshot-2024-02-26-at-102546-am-185053-PKLNjyVv | Data Science Blog — Figure 2 Data preparation methodology

Missing Data Exploration/Imputation

Within the dataset, there were 16 categorical columns with missing data, as shown in the following barplot:

download-4-852589-M24nAGST | Data Science Blog — Figure 3 Percentage of missing data for data tables categorical columns

After reviewing the Ame’s data dictionary, it was apparent that 15/16 columns’ null values were not just missing values, but actually meaningful. For example, NA values in the “Alley” column mean that those homes don’t have alley access. Therefore, all the null values for these 15 categorical columns were imputed with true meaning gathered from the data dictionary.

The only categorical column whose null values did not signify a particular condition of the house was “Electrical.” For this column, its one NA value was replaced with the mode (“SBrkr”), which had a 91.7% frequency.

In addition to the categorical columns, there were 10 numerical columns with missing data, as shown in the following barplot:

download-5-477872-C2Rdl16x | Data Science Blog — Figure 4 Percentage of missing data for data tables numerical columns

For the numerical columns, the null values were imputed with that column’s mean unless that observation did not have that feature based on other columns. For example, if a home did not have a garage, I imputed 0 for null value in the “GarageArea” column. After dealing with the dataset’s missingness, I began to work on feature engineering.

Feature Engineering

During the feature engineering process, the following 8 new columns were created:

‘TotalSF’ = total square footage
‘HighQualFinSF’ = high quality square footage
‘BsmtFinSF’ = finished basement square footage
‘TotalFullBath’ = total full bathrooms
‘TotalHalfBath’ = total half bathrooms
‘TotalPorch_DeckSF’ = total outdoor deck/porch area
YearsBetweenBuildvsSold’ = duration between construction and selling year
YearsBetweenRemodvsSold’ = duration between remodeling and selling year
‘YearsBetweenBuildvsRemod’ =duration between construction and remodeling year

Assess Multicollinearity

Limited and no multicollinearity is one of the fundamental assumptions of Multiple Linear Regressions; therefore, I needed to assess if it was present in this dataset. Based on a correlation heatmap between the table’s numerical columns, there are clear signs of multicollinearity (Figure 5.)

In fact, after looping and comparing each numerical column against each other, a number of column pairs had Pearson correlation coefficients greater than 0.70 (Figure 6.) Many of these pairs contain the feature engineered columns. That makes sense as these columns were constructed from other data table columns so they will obviously be related.

The presence of multicollinearity can have an significant impact on the stability of future Multiple Linear Regression; therefore, it is necessary to perform the following measures:

Feature selection or dimensionality reduction
Assess models that are robust to multicollinearity (penalized linear regressions, tree-based models)

download-6-849995-W0DwIbsT | Data Science Blog — Figure 5 Correlation heatmap for data tables numerical columns

screenshot-2024-02-26-at-102833-am-079749-DeNkyIhp | Data Science Blog — Figure 6 Code and output for finding column pairs that had > 070 Pearson correlation coefficients

Outlier Removal

A common way for measuring outliers is identifying data points that are either:

Greater than Q3 + (1.5 * IQR)
Less than Q1 - (1.5 * IQR)

However, I decided not to use this technique to remove outliers for the following reasons:

A large volume of data would be removed and data is not cheap (Figure 8.)
Although it might improve the accuracy of the model, that advantage would be offset by the loss of opportunities to predict the sale price for high-valued properties.
This decision could also be reassessed after modeling and analyzing the residual values for these observations.

screenshot-2024-02-26-at-103235-am-411700-2gjzJvU9 | Data Science Blog — Figure 7 Distribution of SalePrice along with various outlier detection levels

During the Exploratory Data Analysis (EDA) phase, I observed a data point that did not fall within the expected trend of SalePrice vs various features, including TotalSF (Figure 7.) To avert the problem of an outlier skewing the results, I removed this data point from the data set.

screenshot-2024-02-26-at-103227-am-550607-4FsosDRS | Data Science Blog — Figure 8 Scatterplot comparing TotalSF vs logged SalePrice

Preprocessing

Finally, the last step of the data preparation phase was preprocessing, which is depicted in the process flow diagram below (Figure 9.) The process included in the following steps:

The dataset’s categorical variables needed to be edited according to the downstream model of choice. Those features were either dummified with the first column dropped, dummified without the first column dropped, or LabelEnocded.
Then, those 3 x iterations were either train-test split or the whole dataset was kept intact for cross validation. For train-test split, the observations were divided with 80% assigned to the train set and the remaining 20% in the test set.
Next, for the dummified, train-test split data sets, the columns’ range of values were normalized using StandardScaler. This was specifically done for the penalized linear regressions because a lack of normalization could bias how features are regularized.
The result of the process was 4 x differentiated data formats for future modeling (as shown in Table 1.)

screenshot-2024-03-06-at-125919-pm-031484-JMw39Oht | Data Science Blog — Figure 9 Preprocessing process flow diagram

Data Format	Dummified with Drop	Dummified without Drop	StandardScaler	Label Encoded
1	YES	NO	NO	NO
2	YES	NO	YES	NO
3	NO	YES	YES	NO
4	NO	NO	NO	YES

Table 1. Various preprocessing combinations for modeling.

Along with multicollinearity, another key assumption to Multiple Linear Regressions is normality, meaning that the response variable should be normally distributed. Plotting a histogram of the SalePrice shows that it’s clearly skewed left (Figure 10.) Applying a log transformation and then refitting a histogram makes it apparent that the response variable becomes more normally distributed (Figure 11.) In fact, after calculating the skewness and kurtosis pre and post log transformation, the skewness falls from 1.76 to 0.042, and the kurtosis falls from 5.418 to 1.393. Those results corroborate the assumption that the log transformation improves normality.

download-9-526174-xKFv4LNa | Data Science Blog — Figure 10 Histogram of SalePrice

download-10-531630-om6rXMP0 | Data Science Blog — Figure 11 Histogram of logged SalePrice

Feature Selection and EDA

This dataset contains 88 independent variables, and would be tedious to manually evaluate each feature’s impact on SalePrice. Instead of going through them one by one, I investigated two ways to expedite the filtering of important features:

SequentialFeatureSelector from sklearn
Loop through each column, fit Ordinary Least Squares from statsmodels, extract R²from each model, and filter features with highest R².

Although they produced essentially the same output, the methods are slightly different. Specifically, SequentialFeatureSelector is finding the best combination of features, while the second approach is finding the best individual features.

Using either method, I was able to create a short list of important features. That could then be fed into a loop that either generates a boxplot for impactful categorical features or scatterplot for those that are numerical. Daisy-chaining these feature selection techniques with this type of loop can significantly expedite the EDA process, especially in scenarios when a data table contains many columns.

The output from this analysis shows that categorical variables that had a large impact on SalePrice were the homes’ quality metrics (“OverallQual”, “BsmtQual”, “ExterQual”, “KitchenQual”) and neighborhood (Figure 12.) With respect to the numerical variables, many of the surface area features had a significant impact on SalePrice as well (Figure 13.)

screenshot-2024-02-26-at-51545-pm-722627-E0mJ24l1 | Data Science Blog — Figure 12 Boxplots of most impactful categorical features based on feature selection and EDA phase

screenshot-2024-02-26-at-51557-pm-416231-BT0JgZ5w | Data Science Blog — Figure 13 Scatterplots of most impactful numerical features based on feature selection and EDA phase

Model Building and Evaluation

During this phase of the project, various linear and tree-based models were evaluated, including:

Multiple Linear Regression
Lasso Penalized Linear Regression
Ridge Penalized Linear Regression
Random Forest
Gradient Boosting
XGBoost

Each model's base model (default hyperparameters) was first assessed to vet it's potential applicability for this use-case and not waste time tuning an ineffective approach. Then, once proven to be at a good starting point, the model's hyperparameters were tuned to improve model accuracy. Finally, each model went through 5 x KFold, cross-validation to compare its accuracy and robustness against the other algorithms. The model building and evaluation methodology is summarized in the process flow diagram below:

screenshot-2024-02-26-at-52648-pm-700571-VTvuFYTC | Data Science Blog — Figure 14 Model building and evaluation process flow diagram

Besides Multiple Linear Regression, all models' hyperparameter tuning was performed using sklearn's GridSearchCV, an approach that’s well-established and widely used. For Ridge and Lasso, I also used sklearn's RidgeCV and LassoCV to confirm the GridSearchCV results and enable automatic tuning of regularization strength without requiring manual inputs. For Penalized Linear Regressions, alpha was the only hyperparameter to adjust. For the various Tree-Based Models, I tuned their learning rate, subsample, n_estimators, max_depth, max_features, min_samples_leaf, and/or min_samples split. This tuning helped improve the performance of each model.

As mentioned above, the model was evaluated based on cross-validation. Using sklearn's KFold and a set random_state, the dataset was split into 5 different train-test tables and each model was fed the same splits, ensuring that each model was compared using the same data. By implementing sklearn's cross_val_score and setting cv equal to KFold, I was able to measure the test samples' mean and standard deviation of R²values. Each model's results were then aggregated and compared against each other.

As shown in Figure 15's line plot, the Lasso models have the highest test R² scores, though the model with the highest test R² scores and lowest standard deviation is XGBoost. Given this tradeoff, I am willing to settle for a slightly lower level of accuracy for an improvement in model robustness and reliability from dataset to dataset. Therefore, my recommendation would be to deploy the XGBoost model and use this as the foundation for deciding which homes to flip.

download-11-056158-qzOYRfbi | Data Science Blog — Figure 15 Line plot of test samples mean and standard deviation of R<sup>2<sup> scores during cross validation

Model Deployment and Learnings

As a reminder, the two research questions accessed during this project were:

How can we identify undervalued homes using data?
What are the important levers for sale prices, and are any easier/cheaper to optimize?

To answer question #1, the XGBoost model can be used to plot and calculate the actual vs. predicted SalePrice values (Figure 16) along with the residual values for each observation (Figure 17). The residuals can be particularly useful in finding exciting home-flipping opportunities because the residual value is the difference between the actual and predicted value. Therefore, this methodology can be used to identify undervalued homes that have the potential for high ROI due to their predicted value exceeding their listed value by a significant amount.

download-13-464161-elAd1Vcd | Data Science Blog — Figure 16 Actual vs predicted SalePrice using XGBoost model

download-14-448358-9Xf9iCyn | Data Science Blog — Figure 17 Residuals plot using XGBoost model

In order to understand the important levers for determining a home's price, SHAP (SHapley Additive exPlanations) was deployed with the XGBoost model. SHAP is a method for explaining individual predictions in machine learning models and provides helpful visualizations for better understanding feature importance and relationships with response variables. One of these helpful visualizations is a beeswarm plot that shows both the most important features and the relationship between those features and the target variable (Figure 18). Based on this plot, it's obvious that the most impactful independent variables on SalePrice are HighQualFinSF, OverallQual, and TotalSF. Each of these factors have a positive relationship with SalePrice (i.e. when x increases, y increases.) Although not as impactful, Fireplaces also boost the SalePrice, which might be a cheaper/easier renovation than adding more high quality square footage or increasing the OverallQual. These various variables' relationships can be further investigated in the scatterplots below.

download-008184-hj3quewM | Data Science Blog — Figure 18 XGBoosts beeswarm plot with the top 15 most impactful features

screenshot-2024-02-27-at-61121-pm-241429-droqJuTj | Data Science Blog — Figure 19 SHAP scatterplots of important levers for SalePrice

Using XGBoost and SHAP, we can now combine our learnings from the residuals plot and the SHAP scatter plots to find homes that are both undervalued and can be easily renovated to increase their value. For example, the residual plot below highlights 7 homes with OverallQual rated 7 that are listed $30,000 below the predicted price(Figure 20). Based on the SHAP analysis, purchasing these homes and increasing their OverallQual to 8-10 could lead to $70,000-$120,000 increase in the sale price (from which construction price would have to be subtracted to arrive at net profit.) This is just one example of how this model can lead to increased chances of success when attempting to flip a home.

download-1-367932-99W73jbn | Data Science Blog — Figure 20 Residual plot with highlighted points for homes that have a residual value > $30000 and have an OverallQual rating of 7

Conclusions and Next Steps

Using the optimal model, XGBoost, I am now able to:

Find significantly undervalued homes
And within those homes, identify ones that can be renovated for further increased value

Together, this gives an Ames' home-flipper a powerful tool that will increase their chances of success when looking for new investing opportunities.

For future work, it would be interesting to perform the following:

Stack regressors to further improve predictive power of model
Create an app that enables the user to toggle target residual value and calculate potential upswing when adjusting feature values (i.e. improvement in OverallQual)
Perform similar methodology in areas with higher property value, increasing potential for return on investment

About Author

Bryce Ferraro

My work experience in biotechnology is diverse, ranging from R&D, process development, and scale-up, to overseeing data management within fermentation, protein purification, and high throughput screening. Having worked in two different early-stage startups, my roles and priorities evolved...

View all posts by Bryce Ferraro >

No comments found.

How to increase chances of success when flipping homes?

Introduction

Data Collection

Data Preparation

Missing Data Exploration/Imputation

Feature Engineering

Assess Multicollinearity

Outlier Removal

Preprocessing

Feature Selection and EDA

Model Building and Evaluation

Model Deployment and Learnings

Conclusions and Next Steps

About Author

Bryce Ferraro

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

How to increase chances of success when flipping homes?

Introduction

Data Collection

Data Preparation

Missing Data Exploration/Imputation

Feature Engineering

Assess Multicollinearity

Outlier Removal

Preprocessing

Feature Selection and EDA

Model Building and Evaluation

Model Deployment and Learnings

Conclusions and Next Steps

About Author

Bryce Ferraro

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!