# How to increase chances of success when flipping homes?

**Introduction**

In this project, I am an Iowa-born, real estate flipper and Iām looking for a data-driven model to maximize the return on my investment. To achieve that goal, I need to answer these research questions:

- How can we identify undervalued homes using data?
- What are the important levers for sale prices, and are any easier/cheaper to optimize?

To answer these essential questions, the following high-level methodology was executed as shown in the graph below:

**Data Collection**

For this project, Kaggleās Ames Housing Dataset was used as the data source. This dataset contains 2579 observations and 81 descriptive columns, one of which was āSalePrice,ā the response variable for downstream analysis.

Since the data was open-source, it wasnāt necessary to personally collect and structure the results, which saved a significant amount of time and energy. However, it was still necessary to preprocess the data prior to modeling.

**Data Preparation**

In order to clean and mold the data for modeling, the following data preparation methodology was deployed (Figure 2). In the next few sections, I will elaborate on the individual data preparation steps.

**Missing Data Exploration/Imputation**

Within the dataset, there were 16 categorical columns with missing data, as shown in the following barplot:

After reviewing the Ameās data dictionary, it was apparent that 15/16 columnsā null values were not just missing values, but actually meaningful. For example, NA values in the āAlleyā column mean that those homes donāt have alley access. Therefore, all the null values for these 15 categorical columns were imputed with true meaning gathered from the data dictionary.

The only categorical column whose null values did not signify a particular condition of the house was āElectrical.ā For this column, its one NA value was replaced with the mode (āSBrkrā), which had a 91.7% frequency.

In addition to the categorical columns, there were 10 numerical columns with missing data, as shown in the following barplot:

For the numerical columns, the null values were imputed with that columnās mean unless that observation did not have that feature based on other columns. For example, if a home did not have a garage, I imputed 0 for null value in the āGarageAreaā column. After dealing with the datasetās missingness, I began to work on feature engineering.

**Feature Engineering**

During the feature engineering process, the following 8 new columns were created:

- āTotalSFā = total square footage
- āHighQualFinSFā = high quality square footage
- āBsmtFinSFā = finished basement square footage
- āTotalFullBathā = total full bathrooms
- āTotalHalfBathā = total half bathrooms
- āTotalPorch_DeckSFā = total outdoor deck/porch area
- YearsBetweenBuildvsSoldā = duration between construction and selling year
- YearsBetweenRemodvsSoldā = duration between remodeling and selling year
- āYearsBetweenBuildvsRemodā =duration between construction and remodeling year

**Assess Multicollinearity**

Limited and no multicollinearity is one of the fundamental assumptions of Multiple Linear Regressions; therefore, I needed to assess if it was present in this dataset. Based on a correlation heatmap between the tableās numerical columns, there are clear signs of multicollinearity (Figure 5.)

In fact, after looping and comparing each numerical column against each other, a number of column pairs had Pearson correlation coefficients greater than 0.70 (Figure 6.) Many of these pairs contain the feature engineered columns. That makes sense as these columns were constructed from other data table columns so they will obviously be related.

The presence of multicollinearity can have an significant impact on the stability of future Multiple Linear Regression; therefore, it is necessary to perform the following measures:

- Feature selection or dimensionality reduction
- Assess models that are robust to multicollinearity (penalized linear regressions, tree-based models)

**Outlier Removal**

A common way for measuring outliers is identifying data points that are either:

- Greater than Q3 + (1.5 * IQR)
- Less than Q1 - (1.5 * IQR)

However, I decided not to use this technique to remove outliers for the following reasons:

- A large volume of data would be removed and data is not cheap (Figure 8.)
- Although it might improve the accuracy of the model, that advantage would be offset by the loss of opportunities to predict the sale price for high-valued properties.
- This decision could also be reassessed after modeling and analyzing the residual values for these observations.

During the Exploratory Data Analysis (EDA) phase, I observed a data point that did not fall within the expected trend of SalePrice vs various features, including TotalSF (Figure 7.) To avert the problem of an outlier skewing the results, I removed this data point from the data set.

**Preprocessing**

Finally, the last step of the data preparation phase was preprocessing, which is depicted in the process flow diagram below (Figure 9.) The process included in the following steps:

- The datasetās categorical variables needed to be edited according to the downstream model of choice. Those features were either dummified with the first column dropped, dummified without the first column dropped, or LabelEnocded.
- Then, those 3 x iterations were either train-test split or the whole dataset was kept intact for cross validation. For train-test split, the observations were divided with 80% assigned to the train set and the remaining 20% in the test set.
- Next, for the dummified, train-test split data sets, the columnsā range of values were normalized using StandardScaler. This was specifically done for the penalized linear regressions because a lack of normalization could bias how features are regularized.
- The result of the process was 4 x differentiated data formats for future modeling (as shown in Table 1.)

Data Format | Dummified with Drop | Dummified without Drop | StandardScaler | Label Encoded |

1 | YES | NO | NO | NO |

2 | YES | NO | YES | NO |

3 | NO | YES | YES | NO |

4 | NO | NO | NO | YES |

Along with multicollinearity, another key assumption to Multiple Linear Regressions is normality, meaning that the response variable should be normally distributed. Plotting a histogram of the SalePrice shows that itās clearly skewed left (Figure 10.) Applying a log transformation and then refitting a histogram makes it apparent that the response variable becomes more normally distributed (Figure 11.) In fact, after calculating the skewness and kurtosis pre and post log transformation, the skewness falls from 1.76 to 0.042, and the kurtosis falls from 5.418 to 1.393. Those results corroborate the assumption that the log transformation improves normality.

**Feature Selection and EDA**

This dataset contains 88 independent variables, and would be tedious to manually evaluate each featureās impact on SalePrice. Instead of going through them one by one, I investigated two ways to expedite the filtering of important features:

- SequentialFeatureSelector from sklearn
- Loop through each column, fit Ordinary Least Squares from statsmodels, extract R
^{2 }from each model, and filter features with highest R^{2}.

Although they produced essentially the same output, the methods are slightly different. Specifically, SequentialFeatureSelector is finding the best combination of features, while the second approach is finding the best individual features.

Using either method, I was able to create a short list of important features. That could then be fed into a loop that either generates a boxplot for impactful categorical features or scatterplot for those that are numerical. Daisy-chaining these feature selection techniques with this type of loop can significantly expedite the EDA process, especially in scenarios when a data table contains many columns.

The output from this analysis shows that categorical variables that had a large impact on SalePrice were the homesā quality metrics (āOverallQualā, āBsmtQualā, āExterQualā, āKitchenQualā) and neighborhood (Figure 12.) With respect to the numerical variables, many of the surface area features had a significant impact on SalePrice as well (Figure 13.)

**Model Building and Evaluation**

During this phase of the project, various linear and tree-based models were evaluated, including:

- Multiple Linear Regression
- Lasso Penalized Linear Regression
- Ridge Penalized Linear Regression
- Random Forest
- Gradient Boosting
- XGBoost

Each model's base model (default hyperparameters) was first assessed to vet it's potential applicability for this use-case and not waste time tuning an ineffective approach. Then, once proven to be at a good starting point, the model's hyperparameters were tuned to improve model accuracy. Finally, each model went through 5 x KFold, cross-validation to compare its accuracy and robustness against the other algorithms. The model building and evaluation methodology is summarized in the process flow diagram below:

Besides Multiple Linear Regression, all models' hyperparameter tuning was performed using sklearn's GridSearchCV, an approach thatās well-established and widely used. For Ridge and Lasso, I also used sklearn's RidgeCV and LassoCV to confirm the GridSearchCV results and enable automatic tuning of regularization strength without requiring manual inputs. For Penalized Linear Regressions, alpha was the only hyperparameter to adjust. For the various Tree-Based Models, I tuned their learning rate, subsample, n_estimators, max_depth, max_features, min_samples_leaf, and/or min_samples split. This tuning helped improve the performance of each model.

As mentioned above, the model was evaluated based on cross-validation. Using sklearn's KFold and a set random_state, the dataset was split into 5 different train-test tables and each model was fed the same splits, ensuring that each model was compared using the same data. By implementing sklearn's cross_val_score and setting cv equal to KFold, I was able to measure the test samples' mean and standard deviation of R^{2 }values. Each model's results were then aggregated and compared against each other.

As shown in Figure 15's line plot, the Lasso models have the highest test R^{2} scores, though the model with the highest test R^{2} scores and lowest standard deviation is XGBoost. Given this tradeoff, I am willing to settle for a slightly lower level of accuracy for an improvement in model robustness and reliability from dataset to dataset. Therefore, my recommendation would be to deploy the XGBoost model and use this as the foundation for deciding which homes to flip.

**Model Deployment and Learnings**

As a reminder, the two research questions accessed during this project were:

- How can we identify undervalued homes using data?
- What are the important levers for sale prices, and are any easier/cheaper to optimize?

To answer question #1, the XGBoost model can be used to plot and calculate the actual vs. predicted SalePrice values (Figure 16) along with the residual values for each observation (Figure 17). The residuals can be particularly useful in finding exciting home-flipping opportunities because the residual value is the difference between the actual and predicted value. Therefore, this methodology can be used to identify undervalued homes that have the potential for high ROI due to their predicted value exceeding their listed value by a significant amount.

In order to understand the important levers for determining a home's price, SHAP (SHapley Additive exPlanations) was deployed with the XGBoost model. SHAP is a method for explaining individual predictions in machine learning models and provides helpful visualizations for better understanding feature importance and relationships with response variables. One of these helpful visualizations is a beeswarm plot that shows both the most important features and the relationship between those features and the target variable (Figure 18). Based on this plot, it's obvious that the most impactful independent variables on SalePrice are HighQualFinSF, OverallQual, and TotalSF. Each of these factors have a positive relationship with SalePrice (i.e. when x increases, y increases.) Although not as impactful, Fireplaces also boost the SalePrice, which might be a cheaper/easier renovation than adding more high quality square footage or increasing the OverallQual. These various variables' relationships can be further investigated in the scatterplots below.

Using XGBoost and SHAP, we can now combine our learnings from the residuals plot and the SHAP scatter plots to find homes that are both undervalued and can be easily renovated to increase their value. For example, the residual plot below highlights 7 homes with OverallQual rated 7 that are listed $30,000 below the predicted price(Figure 20). Based on the SHAP analysis, purchasing these homes and increasing their OverallQual to 8-10 could lead to $70,000-$120,000 increase in the sale price (from which construction price would have to be subtracted to arrive at net profit.) This is just one example of how this model can lead to increased chances of success when attempting to flip a home.

**Conclusions and Next Steps**

Using the optimal model, XGBoost, I am now able to:

- Find significantly undervalued homes
- And within those homes, identify ones that can be renovated for further increased value

Together, this gives an Ames' home-flipper a powerful tool that will increase their chances of success when looking for new investing opportunities.

For future work, it would be interesting to perform the following:

- Stack regressors to further improve predictive power of model
- Create an app that enables the user to toggle target residual value and calculate potential upswing when adjusting feature values (i.e. improvement in OverallQual)
- Perform similar methodology in areas with higher property value, increasing potential for return on investment