Evaluating Home Buying Opportunities with Machine Learning
Purchasing a home is one of the most consequential decisions that any person can make. Whether it is to procure a home for your family, invest in a high-returning asset, or to do a bit of both, it is pivotal that a prospective home buyer makes this purchase at the correct price. Having said this, determining the appropriate value of a home can be a significant challenge due to the fact that the valuation of a home is dependent on a seemingly endless number of features that a home can possess. To address this challenge, this project endeavors to implement machine learning techniques to predict the sales price of houses in Ames, Iowa.
Which home features are the most impactful on a home price?
Which of these can make the life-changing decision to purchase a home a correct decision?
Please join me to find out.
Data Set and Model Evaluation
The dataset set utilized in this exercise was curated from Kaggle and contains information on nearly 3000 homes in Ames, Iowa. All of these homes have 79 features that paint a full picture of every aspect of a particular home. These features are both categorical and continuous in nature will be considered by our machine learning models in evaluating the final price of each home.
The performance of the machine learning models in this project will be evaluated by comparing the predicted home values with the actual sales price. The metric through which we measure this will be the Root Mean Squared Error (RMSE).
Before training our models, it is pivotal to examine our data and transform them in a manner that is suitable for our predictive models.
Observing the Dependent Variable (Sale Price)
As a start, when we observe what we are trying to predict, we can see that Sale Price is positively (or right) skewed with much of the home prices landing the $100k to $200k range. This indicates that our Sale Price is not normally distributed, thus causing issues when we train linear models. Plotting the quantiles of this distribution against the normal distribution through a Q-Q plot further confirms this suspicion.
To address this issue, applying a logarithmic transformation will make the distribution of Sale Price more fit for training our models as we can confirm with the histogram and Q-Q plot that we see after applying this transformation.
Moving along, we further transform our data by removing outliers. The two salient features that help in identifying outliers in this exercise are General Living Area and Total Basement Square Footage
General Living Area
Here we see two observations with substantial living areas that have Sale Prices that are not consistent with what we would expect.
When we remove these outliers, the linear relationship between Sales Price and General Living Area is a little clearer.
Total Basement Square Footage
On a similar note, we see homes with significantly larger basements, which do not seem to be in line with the linear relationship between basement size and sale price.
Removing this outlier allows us to more easily see the relationship between the variables.
Missing Data / Imputation
The next step in preparing our data lies in accounting for missing data. There are 30 features in this data set - all but 6 of them have less than 5% of observations missing as shown below:
To remedy this issue, we must impute missing values or drop less important features with missing values. Broadly speaking imputation methods can be described and summarized below:
- Categorical variables (missingness implies the absence of the variable)
- Continuous variables (missingness typically implies zero, but in some circumstances, mean or median is used)
- Other (continuous variables where mean or median are appropriate or variables that can be dropped altogether due to full imbalance)
Now that we’ve prepared our data for our models, we can train our models and evaluate which are the most effective in predicting home prices.
Using our methods, we find that Lasso Regression is the most effective model on a standalone basis (RMSE of 0.124). However, as there are merits to the different methodologies, such as XGBoost’s capacity to capture nonlinear relationships or Random Forest’s effectiveness in decorrelating the data, the best method overall, is to ensemble the methodologies with equal weight (RMSE of 0.119).
In our mission to determine which features have the most impact on Sale Price, we can conduct a process called regularization, which isolates the most important features through leveraging a penalty term, alpha, to eliminate less important features, making their coefficient values in linear regression zero.
When we apply regularization through our Lasso Regression model, we see that the important variables that can be leveraged to generate home value are:
- Overall Quality
- General Living Area
- Total Bathrooms
- Garage Car Capacity
- Basement Square Footage
All of these variables can serve as levers to build value and present opportunities to purchase homes at a price that is favorable.
In conclusion, when leveraging machine learning techniques to make an informed home purchasing decision, it is best to ensemble different machine learning models, all with their own merits, to extrapolate a valuation. That said, once a home is purchased and avenues to improve home value are being considered, we humbly offer the following recommendations to generate value:
- Renovate the material and finish of the home to improve Overall Quality
- Build a new wing in the home to expand General Living Area
- Add additional bathrooms to the home
- Expand car capacity in the garage
- Consider renovating and expanding the basement