Data Driven Predictions of House Prices in Ames, Iowa
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Ames, Iowa, is a college town centered around Iowa university. There are many distinct neighborhoods of various socio-economic classes. It represents a good opportunity for real estate investment at multiple financial levels, especially house rentals, due to a guaranteed stream of potential renters from the nearby university. When the market conditions are right or the prospect of being a landlord is no longer attractive, we want to make sure our investment returns a high value. Machine learning can help us choose homes with a high resale value, by creating a predictive model of sale price based on house features. This allows us to invest in property with a higher resale value. To use machine learning we need data.
Luckily, we have a robust dataset of housing prices in this area spanning from 2006-2010 including sale price and about 80 different metrics, from house size, to neighborhood, number of rooms and even shingle type.
Although this dataset is very useful, it is not perfect and there is some missing data and observations encoded incorrectly. About a dozen features encode NaN (not a number) to signify a feature is missing. It is a trivial matter to reencode these values to to 'none' .
Other observations have missing data. Although the simplest method of training a predictive model is to simply drop these records, the more data we train our model on the more accurate it will become.
There are seven features remaining with missing data, we can tackle them one by one.
A little investigation reveals that how a lot is zoned is highly correlated with the neighborhood that lot is in. In other words, most neighborhoods have a majority that are zoned in a specific category. Therefore, we can confidently fill in the missing zoning data with the most common zone in that neighborhood.
There are two missing values in the feature describing the utilities hookup of the house. Simple observation reveals all other houses have a 'public' utilities hookup. Therefore, the probability of these two missing values is anything other than public is extremely low.
Correlation analysis reveals that kitchen quality maps almost perfectly to the overall quality of the house. If a house is listed as having a 'good' quality, over 95% of the kitchen qualities will also be 'good' the same is true for all other categories. We can therefore substitute the overall house quality for any missing values in the kitchen quality
This feature in our dataset describes the overall functionality of the property. Over 95% of the other observations listed have a functionality described as 'typical'. These records with missing data also have overall house condition listed as good or better. Reason dictates it is highly likely that these houses have 'typical' function as well
Number of Cars Garage and Garage Area
There are also two values missing from both of these features. In all, our dataset contains five variables relating to garages, and a simple inspection of these two observations reveals these houses do not have a garage on the property. We can encode these missing values to reflect that.
Lot frontage is described as “Linear feet of street connected to property”. Over 15% of our data contains missing values for this feature. Since this is a numerical feature instead of categorical, we have to take special steps to impute it. We find that lot frontage is correlated with lot area, therefore we can find the average ratio of lot area to lot frontage and use this value to impute the missing lot frontage values.
Feature Engineering and Dimensionality Reduction for the Data
It is informative to look at the distribution of the sale prices of the houses in our dataset.
Our predictions will be more accurate if our model's target variable, sale price, is normally distributed. We can perform a Box-Cox transformation on the sale price, and reverse the transform after making predictions to get interpretable values.
There are three features related to indoor living area: first, second, and basement square footage. Looking at the relationship of these variables against sale price shows a bimodal nature. There are multiple observations of second and basement square footage of zero, these values are skewing the linear relationship of square footage to sale price.
The outdoor square footage variables show the same behavior.
This bimodal nature also holds true for the number of different types of bathrooms.
We can combine these three sets of related variables into three distinct variables to discover the true relationship with sale price.
The combined indoor square footage shows a very nice linear relationship to sale price, making it very useful in modeling house prices.
The bimodal nature (observations with zero outdoor square footage) is still present, but significantly reduced by combining all four outdoor square footage variables.
The relationship is not as clear with total number of bathrooms, further analysis will be needed to determine if this variable will be useful in modeling.
We have just eliminated nine variables without losing any information and potentially improving predictive capabilities of our models.
Quality and Condition
There are a dozen variables in our dataset related to the quality and condition of various features of the house. Although, it is natural to think of the quality and condition of an item before purchasing it, the very process of applying these metrics is subjective. There is a high protentional for random influence based on individual preference.
Let's start by looking at the relationship of the six quality variables with sale price.
Garage and pool quality seem to have very little to do with the final sale price of a house, even fireplace quality is on the edge of usefulness. We can explore this relationship in a little more detail visually.
A linear relationship is evident in the features with high correlation to sale price, I feel confident in dropping the remainder.
We can perform the same analysis on the features relating the the condition of the house.
These features do not look useful at all for modeling purposes. Perhaps a visual inspection of condition versus sale price will reveal hidden relationships.
The flatness of the trendlines of these features further cement the idea that the condition variables are not useful. Perhaps it is true that the condition of a feature is too subjective to be useful.
Further Dimensionality Reduction
Using the same methods illustrated above, bivariate and correlation analysis we can further eliminate other not useful features:
- Number of Bedrooms
- Number of Kitchens
- Finished Square Feet
- Garage Year Built
- Month Sold
- Lot Frontage
- Pool Area
We have successfully eliminated nearly 50% of the features form our data. This will result in a much more stable model.
Variable Redundancy of Data
Exploratory data analysis serves many functions, increasing subject competency, finding obvious relationships between variables, and sometimes discovering data redundancy. This particular data set has two sets of variables that describe the same feature
Garage Area and Number of Cars
These two variables describe the same thing, the size of the garage. To confirm this we can calculate the variance inflation factor for these features.
A high VIF indicates a redundant variable. The only question is, which one to keep?
A simple plot of these two variables against sale price shows a linear relationship for both. The trendline analysis also shows similar R-squared for both relationships, leading to the conclusion it does not matter which variable we keep. I decided to keep the number of cars variable, simply because it is simpler.
Year House Built and House Remodel Date
When calculating the variance inflation factor for the previous example, I discovered a mystery.
A VIF in the tens of thousands is an oddity. Plotting a histogram of the two variables reveals an even stranger occurrence.
The two variables show the exact same distribution (besides for a spike at 1950). A little more analysis shows that if a house was never remolded the the relevant data field is populated with the year the house was built. If the house was built before 1950 and never remodeled, the value '1950' is encoded.
Filtering the data to only accurate values of the year a house was remolded show there is no significant linear relationship with this variable and sale price. So that we do not eliminate potentially relevant information, I added a binary classifier indicating if a house was remolded, and eliminated the year the house was remolded field.
Trimming Categorical Features
Mathematical analysis of numerical features if inherently simpler than categorical ones. In order to determine the usefulness of nominally encoded features we can encode them numerically and perform a chi-squared test.
A low chi squared alone is not necessarily an accurate means of determining the usefulness of a feature, but the lowest scoring five features do not logically seem to be good indicators of house price. I feel safe in dropping them.
Using Lasso Coefficients for Dimensionality Reduction
Now that our data has been reduced to its most logical and useful features through manual analysis we can attempt to create a stable Lasso model to further simplify. Lasso's implementation of a penalty term in linear regression will force non-relevant features' beta coefficients to zero. This indicates they will not be useful in our predictive modeling.
We can plot the lasso coefficients to approximate the impact each variable will have on our final model.
Our data exploration has shown a strong linear relationship between many variables and sale price. This indicates linear regression may be an appropriate tactic for predicting house price.
After scaling our numerical feature data using sklearn's standard scaler, pandas dummification functionality, and creating a 70/30 split of train and test data we can begin to fit linear models to our data.
All hyperparamters will be determined through a thorough cross validation grid search and verified by scoring the fit on the test data set.
Plotting predicted versus actual sale price shows a normal distribution of residuals, a Q-Q plot shows a dense straight line for the bulk of our data, and an R Squared of .921 on the test data set signifies a reasonably good fit. The mean average error of $15,710 is the average dollar amount each prediction is off by.
The next logical step after Lasso regression is to attempt an Elastic Net model.
There is a slight improvement from the lasso model, a .005 increase in R-Squared and a reduction in mean average error of $442. This is most likely the limit of the accuracy of strictly linear models.
The next step would be tree-based models. The preprocessing steps will be slightly different for these models. Dummification is not necessary and one hot encoding will be used instead to numerically encode categorical features.
A random forest model is strictly worse than our elastic net predictions with a lower R-squared and higher mean average error.
This does not necessarily mean tree-based models will not be good fits for our data, we can try different frameworks, for example gradient boosted tree based models.
XGBoost seems to give the best fit to our data with the highest R-squared and lowest mean average error.
One benefit of tree-based models is the ease with which we can extract feature importance.
Our tree-based models show the same ranking of importance for our numerical features and illustrate that the overall quality and total square footage are the two most influential features to sale price.
Weather the house has central air conditioning and the neighborhood it is located in are the most important categorical features to home sale price.
Our modeling shows that in order to maximize house resale value we need a large house with good overall quality and central air conditioning in the right neighborhood.