Ames housing
Introduction
Purchasing a home can be one of the biggest decisions of a persons life. A mortgage can last a lifetime and determining if one wants to incur that debt is crucial in the home buying process. A multitude of factors go into the decision making process, and clearly home price is one of the major ones. The price of a home can help determine if a buyer is even eligible to purchase the home since many mortgages are income based and come with upper limits.
There exist already many apps (i.e. Zillow, Realtor, Redfin etc...) that automate this process and give buyers the opportunity to see potential home prices and costs. Famously Zillow offered a 1 million dollar prize to anyone in the world who can improve their algorithm to predict home prices (¹). The Goal of this project is not to improve or compete with these sophisticated sites, but rather to get insights as to how they work and how one might solve the problem of home price prediction.
DataSet
The data set used is from a Kaggle Competition that occured in ~2008(2), this data consists of home features and prices from a small town in Iowa with the following:
- 2580 rows
- 81 columns (i.e. features)
- 38 numerical features
- 43 Categorical features
PreProcessing
The numerical features present with <1% missing features as shown below. The feature with the largest amount of missing data is the 'GarageYrBlt' (5%).
The categorical features present with much more missing features than the numerical (>20% avg) and the top 5 (as shown in the histogram below) have >50% missing data.
The numerical features were filled in with NA's for the missing values.
The categorical features were dummified, which increased them from 43 -> 255. The dummification process itself takes care of the categorical missing values by filling them in with 0.
Further more, sklearn was used to split the data into (post dumification) into train/test by 40%/60%.
Feature Importance
Initially the feature importance was extracted using two different methods. A linear regression methodology was used (sklearn->f_regression) as well as a non-liner 'Mutual Information' method (sklearn -> mutual_info_regression). The mutual information algorithm allows for a certain degree of non-linearity amongst the input features and can differ from the linear method if there is strong non-linear covariance.
The charts below show that the top 3 features are different for the linear vs the non-linear feature importance alrogithms. This hints that there is underlying non-linearities in our data that we need to take into account for proper modeling.
A random forest (RF) model was also run on the data and the feature importance derived from this model was used as the main results for the evaluation. RF is much more reliable than linear regressive models when there is inherent non-linearities in the underlygin data set. The RF feature importance is shown below and the Top 3 features that predict a home price are:
- OverallQual: A metric from 1->10 that describes the quality of the home. Factors that can affect it are "year remodeled", lawn upkeep, etc.. This is a sort of 'curb appeal' measure of the home.
- GrLivArea: Genreal Living Area describes the sq. ft. of living space of the home. An indirect measure of house size.
- TotalBsmSF: Describes the size the of the basement.
Modeling
Five different models were used in this analysis and their results were compared in order to asses which performs the best according to predictive accuracy.
For each model we used the following:
- GridSearch on hyperparameters: Used to tune the hyperparameters
- 2-fold CV: The training data was further tuned using two fold cross validation
- 2 jobs (improve calc time)
- Score = R^2: This is the metric used to evaluate the models
- 'random_state': [0]: All models were initialized with the same random state in order to produce a fair comparison.
The five models and the hyperparameters tuned are shown below.
MLR Hyperparms:
- normalize
Lasso/Ridge Hyperparms:
- 'alpha': range(0,100,10)
- 'normalize': [False, True]
- 'tol': [1e-1]
Elastic-Net Hyperparms:
- 'l1_ratio': np.linspace(0,1,11)
- 'alpha': range(0,100,10)
- 'normalize': [False, True]
Random Forest (RF):
- 'n_estimators': range(10,150,10)
- 'max_features': range(1,100,10)
- 'max_samples’:np.linspace(0,1,11)
Gradient Boost (GB):
- 'learning_rate': np.linspace(0,1,11),
- 'n_estimators': range(100,300,25)
- 'subsample': np.linspace(0,1,11)
The table below shows the five models training results and there is a clear winner with the GB approach which has <2% error in the test data.
The plot below shows how well the GB model (blue points) predicts the actual home prices (red line).
Feature Selection
In order to make this model more amenable to usage the top 15 features were evaluated using the GB approach. If we were to actually deploy our results to the public, each user would have to enter 81 different features of a given home in order to get a prediction. This is very tedious, so we do feature selection to understand which are the most important features.
The feature selection process consisted of running models with increasing number of features which were taken from the RF feature importance results shown earlier. We would run a model with only one input feature (such as OverAllQual), then run a second model with two feature (OverAllQual + GrLiveArea) and so on. We ran 15 different models and were able to show that with only the top 6 features you can achieve an accuracy of ~95% (see figure below). Thus, if we wanted to deplot this model we would only need the top 6 features introduced by the user in order to get a price estimate within 5%.
Conclusions and Future Work
We were able to show that a GB model (using all 81 features) can be accurate up to ~98% for the price of a home. Furthermore, we showed that a model using only the top 6 features can still retain an accuracy of ~95%.
There is still allot of potential future work for this effort:
- Deploy a WebApp using the GB model
- Include an interactive map of predicted home prices by area
- Explore more models (Neural Networks, SVM…)
- Normalizing skewed data
- Feature Engineering
References
- “Zillow Prize.” Zillow, 1 June 2020, www.zillow.com/z/info/zillow-prize/.
- PremaV. “Ames Housing.” Kaggle, 10 Sept. 2018, www.kaggle.com/datasets/prevek18/ames-housing.