Ames Iowa Housing Market Data Analysis
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Introduction
The Ames Iowa Housing Dataset is a public dataset made available by Kaggle. The dataset is a record of houses sold in Ames Iowa between the years 2006 and 2010 with each entry being represented by 79 exploratory data variables and the sale price of a house. This project intends to use regression techniques to predict the sale price of houses using the exploratory variables and to determine which variables are most likely to increase the property value of a home in Ames.
The code used for this project can be found on Github.
About the dataset
The Ames Iowa dataset was collected over the course of several years in order to characterize the housing market of Ames, Iowa. Before conducting this analysis, I made a hypothesized which variables would most strongly correlate with an increase in housing prices. Of the variables, I predicted that the following four variables would be the most important.
- Lot Area - The square feet of the lot for sale
- Overall Quality - A rating (0-10) defining the material and finish quality of the home
- Ground Living Area - The home's square footage above ground level (excluding the basement level)
- Neighborhood Location - The name of the neighborhood where the house is located
Background Data
Before exploring regression techniques, I looked at the documentation of the dataset that can be found here, and I followed any recommendations. The most important caveat presented by the dataset documentation is addressing the outliers in the data. The study that collected this data mentioned that several entries recorded the sale of properties with a very high square footage and at a very low sale price with respect to other houses sold. Because of this, it is recommended that any house with a ground living area greater than 4000 square feet be omitted from the data. After removing these outliers, the sale price versus ground living area clearly demonstrate a positive trend as ground living area increases.
In addition to assessing potential outliers, I decided to investigate the sale price per year built. I wanted to determine the distribution of sale price as it related to the year of the home built to determine the impact of the market on how many houses were built, the sale price of the houses built, and the general impact of inflation. Based on the results of this plot, it is clear that this is an upward trend consistent with inflation, and certain economic events have had an impact on the number of houses built in each year from 1991 until 2010.
Missing Data
Many entries in this dataset contain missing variables. In total 19 variables demonstrate some degree of missingness. Of those varables four have a very high number (0ver 75% of all entries) of missing values. Those variables include:
- Pool Quality
- Fence
- Alley
- Misc Features
In order to get the most value out of this dataset, instead of dropping variables with missing data, I imputed the missing data based on characteristics of the available data. Based on the most recent city planning report that I was able to find, it is clear that neighborhoods within Ames have some uniform characteristics because of the city's water resource management plan. Because of the uniformity of neighborhood development and lot size, I determined that it would be reasonable to impute missing data for the following variables based on the average characteristics of the neighborhood location of each entry.
- Garage year built
- Lot frontage
For the remaining 17 variables, I imputed the data based on the average for all values of the variable.
Data Cleaning
Several regression algorithms were used to predict the price of unlabeled housing entries, and two data cleaning methods were used.
Multiple Linear Regression Data
Because Multiple Linear Regression (MLR) is not robust to the issue of multicolinearity, variables that were determined to have a high correlation were dropped from the dataset for MLR analysis. The 29 least correlated variables were determined to have an acceptable correlation with all other variables, and all other variables were dropped.
All Other Regression Algorithms
The distribution of the Sale Price data appeared to have a slightly right skew. Because regression algorithms are to be used in analysis, it is favorable for the Sale Price data to have a normal distribution. To correct the skew the natural log transform was applied to the Sale Price data. The distribution of the sale price data can be seen before and after the natural log transform in the figures below.
Regression Results
Five regression based and tree based algorithms were used to predict the sale price of the housing data.
- Multiple Linear Regression (MLR)
- Ridge Regression
- Lasso Regression
- Elastic Net Regression
- Gradient Boosting
The root mean squared error and r squared values of each algorithm's training results, after hyperparameter tuning, can be seen in the figure below.
It is clear that in training, gradient boosting performed with the lowest error, best fitting the sale price of unlabeled validation entries. In gradient boosting, the most accessible hyperparameter is the number of rounds of gradient descent. The image below shows the root mean squared error plotted with respect to the number of boost rounds. It is clear that the test rmse mean appears to show diminishing returns as the boost rounds increase past 50. To conserve computing resources, 50 gradient descent rounds were used in testing.
Variable Importance
A price per unit change analysis was performed on the variables used in MLR. It is important to use MLR instead of other regression algorithms because the unit price is accurately reflected by the model, whereas in other algorithms used, penalty hyperparameters and stochastic elements artificially induce a nonlinear relationship between independent variables and sale price. In MLR, it appeared that pool quality was given a strong per unit importance because of the granularity of the variable itself. Of the five highest ranking variables based on per unit sale price increase, four of them were related to the pool.
The reason for this is because dummy variables were created in order to represent ordinal data. After eliminating the ordinal data, it is clear that ground living area and roof material become the most dominant features in the per unit change analysis.
Conclusions
It is clear that regression algorithms can be used to accurately predict the price of houses in Ames, Iowa based on the variables recorded in this dataset. Additionally there are several features that greatly impact the price of a house. For anyone who owns a home in Ames, Iowa interested in listing their house, it would be important to consider adding and addition or updating the roof tile material before selling the home to maximize the return a potential sale.