Finding Undervalued Homes in Ames, IA
Team Members: Jesse Egoavil, Aleksey Klimchenko, Nixon Lim, Alex Pinkerton
Source Code: GitHub Repository
The skills the authors demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Objective and Target Audience
Our objective for the project was to find undervalued homes, determine the most important features that can be improved upon, and from there, pick out features that would have the biggest incremental impact on sales price for homeowners or real estate investors looking to make home improvements or flip houses.
Data About Homes and their Features
The dataset we used consists of 2580 observations of 82 features, with each row representing a home sold in Ames, Iowa. For the purpose of prediction, we largely focused on the bottom 80% of homes by Gross Living Area. In this subset, the Sale Price and Gross Living Area by square feet followed a linear relationship while the top 20%, as the Gross Living Area increased in a home, had a diminishing contribution to the Sale Price.
The Main Attribute Types in the Prices of Homes
We split the features in the dataset into three main attribute types to be handled in our analysis differently: there were 20 Ordinal features (such as Overall Quality, a rating between 1 and 10), 25 Categorical features (such as the names of the neighborhoods), and 36 Numeric features (such as the area, in square feet, of the homeโs garage). We also added our new feature, Sale Price per Gross Living Area, which was our target variable for prediction & defined as the homeโs Sale Price divided by the Gross Living Area.
Then, the next step was to clean the data and deal with null values. Shown below, there are two tables that show the percent to total observations for each feature that nulls represent in the data. The main features of concern when dealing with numeric features were Lot Frontage and Garage Year Built. For Lot Frontage nulls, we imputed the median Lot Frontage values by Neighborhood, and filled missing values with that calculated number. And with Garage Year Built, in order to not skew data, we set the median value by the Home Year Built, which in observations with both values, were closely correlated.
For categorical null values, we consulted the datasetโs data dictionary, where it was stated that for these features, if an observation was null, the home did not have the specific feature, and nulls were set to โNoneListed.โ
The Removed Observations
After extensive exploratory data analysis on the raw dataset, we chose to remove 28 total observations from the data, including one duplicate. Within the following features, we removed values that had potential to skew the data:
- โMiscFeature" : "TenC", "Othr"
- Only one home had โTenCโ or tennis court and covered an abnormally large area.
- Three homes with โOthrโ, for which there was no information.
- "Utilities" : "NoSewr"
- Two homes used a septic tank, whereas the rest had public utilities included. This feature was then removed because all remaining homes had the same value.
- "Functional" : "Sal"
- One home was bought for salvaging materials.
- "Heating" : "Floor"
- One home used a floor furnace for heating.
- "SaleCondition" : "Family", "AdjLand"
- 17 homes were sold to family members, which could imply a lower sale price than at-market.
- Two homes were sold as part of a sale of adjacent land.
After the extensive initial data cleaning, we started Exploratory Data Analysis.
Exploratory Data Analysis About the Pricings of Homes
With the new target variable, Sale Price per Gross Living Area, we need to analyze the features between the Sale Price and Sale Price per Gross Living Area. When analyzing the numericals, we saw that the values in the Sale Price per Gross Living Area were more linear and evenly spread . The ordinals appeared to have similar results as the numericals, so it started to look like the Sale Price per Gross Living area was the better target variable.
Numerical:
Ordinal:
Also, with the 20 ordinal and 25 categorical features, we looked at the counts for each value within each feature and how each one compares to the median Sale Price and Sale Price per Greater Living Area for each value within each feature. These comparisons were visualized using tables and boxplots, like the example seen below:
From the analysis, the remaining features and values behaved as expected. In the above example, the boxplots show that neighborhoods have a large effect on Sale Price, but a smaller effect on Sale Price per Gross Living Area. This tells us that when accounting for the size of the home, the neighborhoods do not have as large of a gap in value, meaning that some neighborhoods have larger homes than others. This kind of analysis reinforced our decision to use Sale Price per Gross Living Area over Sale Price as our predicting factor.
Multicollinearity:
Finally, we also looked into the possibility of multicollinearity among the features to avoid in the following steps. Some of the features that we focused on with high multicollinearity are Garage Area and Garage Cars, and Total Basement Square Feet and 1st Floor Square Feet.
Preprocessing
For the linear models, we performed feature selection on the numerical and categorical features separately. When handling the numerical features, we removed features that caused multicollinearity or did not provide significant information to improve our basic multiple linear model, based on our adjusted R squared.
The 20 ordinal features were label encoded, with null values or NoneListed represented as 0. The categorical features were then dummified, dropping the first dummy variable in each group. The label encoded ordinal features and dummy variable groups were, then, added to our remaining table of numerical features in a stepwise process, again, making sure that each addition would increase our adjusted R squared. At the end of this process, we included 117 out of a maximum of 234 features in our dataset.
Random Forest used a larger set of features. This included all numerical features, label encoded ordinal features, and all dummified categorical variables, without dropping any dummy variables.
Models, Methods, and Analysis About the Pricings of Homes
The first models fitted were multiple linear regressions. For the sake of interpretability, numeric features were normalized so that when looking at the coefficients of features later, we could easily discern which features had stronger positive and negative correlation to a homeโs Sale Price per GLA. From this multiple linear regression model, there was a strong indication that certain neighborhoods in Ames had high Sales Price per GLA, especially Green Hill. Features with very negative coefficients were several MSSubClass Zoning categories, indicative that homebuyers tend to avoid homes with those Zoning Codes.
For the purpose of homeowners & real estate investors seeking to make home improvements to raise the Sale Price of homes, we then narrowed the features down to those that could be improved (Kitchen Quality could be improved, for example, but the neighborhood a home is in cannot be changed) to see what improvements would have the highest impact on a homeโs Sale price.
Penalized Linear Regression
The same procedure was repeated with penalized models, grid searching for the best estimatorโs value of Alpha (which affects the strength of penalty applied to minimizing the RSS versus minimizing the sum of squares of the coefficients). Using the Lasso model was particularly useful in our understanding of the importance of features on the target variable, as the worst predictors of Sale Price per GLA were removed by the model. A very similar pattern emerged in the penalized models as with the multiple linear regression model in terms of most positive and most negative coefficients and with improvable features.
Random Forest Regression
Following the same process, we implemented Random Forest Regression. After tuning the number of trees and the maximum depth each tree could reach, we reached the values 500 and 11, respectively. The charts below show the features ranked by feature importance as Random Forest does not use coefficients.
Again, we filter out the features that cannot be changed to make a new chart.
Unfortunately, the Random Forest is not a reliable model for this purpose. When using this model, one must consider the maximum depth of the Forest based on the number of observations. Our training set contained 1,531 and each split separates the observations into 2 groups. With a best maximum split of 11, we can see that the training set is split into 2^11 or 2,048 groups, more than the number of observations. From this we can see that this model is obviously overfitted.
Support Vector Regression (Linear)
To be able to determine the coefficients that find the important features, we went with a โlinearโ kernel. Similar to the process mentioned, we were able to use grid search to get the โCโ and epsilon (ฮต) parameters. The top overall features were 1st floor square feet, 2nd floor square feet, โOverall Qualityโ, and โYear Builtโ. For the improvable features, the top ones were home exterior material, โOverall Quality, and โOverall Conditionโ.
XGBoost Regression
We decided to use the XGBoost library for our gradient boosting model. Since our categorical data had been label or dummy encoded, we could apply the XGB Regressor model to our data. The default gbtree was used instead of gblinear or dart because of the higher accuracy using default parameters.
XGBoost offers a wide selection of parameters for fine tuning. We ended up tuning 9 different parameters via GridSearch: learning_rate (0.05), n_estimators (1000), gamma (0.5), max_depth (3), min_child_weight (10), subsample (0.4), colsample_bytree (1), reg_alpha (1), reg_lambda (3). Most parameters were tuned sequentially, so that changes could be made over time. Then, a final tuning was performed to confirm these were the optimal parameters.
Upon fitting our XGB model, we looked at feature importances to identify and compare them to our other models. The top 20 features by weight are shown below:
Many of these features were considered too difficult to improve without a great amount of effort on the homeownerโs part; removing them would allow us a better look at the features that would be easiest for a homeowner to improve in order to raise the SalePrice of their home. After removing those โdifficultโ features, we arrived with the following list:
Results, Prediction, and Conclusion
Best Features to Improve House Sale Prices
We chose the five features that all models valued positively to change so that that homeโs SalePricePerGLA would rise, and thereby increase the SalePrice:
- 'OverallQual' : set to 9 (out of 10)
- 'OverallCond': set to 9 (out of 10)
- 'BsmtFinSF1': add 1/2 of BsmtUnfSF
- Divide BsmtUnfSF by 1/2
- 'Functional': set to 6 (equivalent of Min1)
- 'BsmtExposure': set to 3 (equivalent of Av)
The setting is the change in these features to be applied to undervalued homes and predict a possible increased price after home improvements are made.
Model Results for the Pricing Data for Homes
From our modeling, we got the following scores:
We chose to use the supervised models over the penalized linear models for the rest of our analysis, sacrificing a clearer relationship between the features and the predicting factor for accuracy in our final result.
Predicting Improved Sale Prices for Houses
We used a naive threshold for determining whether a home was undervalued. This threshold would likely be different if we had domain knowledge or more data available to us.
First, we computed the mean (and stddev) SalePricePerGLA for each neighborhood. For this calculation, we assumed a standard distribution of SalePricePerGLA for each neighborhood. One neighborhood had only one home in it, so we removed it from our list.
Next, we computed the threshold for each neighborhood: that neighborhoodโs mean SalePricePerGLA minus the neighborhoodโs stddev SalePricePerGLA. In other words, our upper limit for an undervalued home was one standard deviation below the neighborhood mean SalePricePerGLA. Any homes below this threshold limit were set aside.
After gathering all the undervalued homes, we set the top improvable features of all models to near-max value, as stated in a previous section. We paid particular attention to 'BsmtFinSF1', or Finished Basement Square Feet. To make sure we did not artificially increase the house size, we reduced the homeโs 'BsmtUnfSF', or Unfinished Basement Square Feet, by half when increasing the Finished Basement Square Feet. Using our table of adjusted observations and supervised learning models, we predicted a new SalePricePerGLA for the โimprovedโ homes and multiplied them by each corresponding GrLivArea to get an โimprovedโ Sale Price for each observation.
โNewโ and Improved Homes by Model
From the top 50 homes (by gain in predicted Sale Price) of each model, we had the following overlaps of PIDs:
- XGBoost/Lasso Overlap: 21
- RF/Lasso Overlap: 16
- XGBoost/RF Overlap: 32
- SVR/Lasso Overlap: 15
- RF/SVR Overlap: 17
- XGBoost/SVR Overlap: 14
We believe the homes from the XGBoost/SVR overlap would be the best picks, since those two models had the highest test scores.
XGBoost Regression:
SVR:
Further Work About Undervalued Homes
With more time, we would like to:
- Determine a better criteria for defining an undervalued home.
- Get information from domain experts.
- Use Binary Classification Models to predict the undervalued homes and find features that may stand out for these homes.
- Logistic Regression
- Random Tree Classification
- Support Vector Classification
- XGBoost Classification
- For the predicted improved home price, try predicting Sale Price directly instead of Sale Price per Gross Living Area.
- Repeat our work with the top 20 % by Greater Living Area group and compare.
- Get more data in the top 20% as there are too for observation to make a good model.