Statistical Data Inference on Housing Prices
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Background and Motivation
Home buyers and home sellers are often faced with the difficult task of purchasing or selling a house at a good value. But how can we determine the value of a house? What features of a house are most important when trying to estimate its price? What is the dollar impact on price that one particular feature has compared to another feature? Answering these question would make the lives of home buyers and home sellers a lot easier. In this project, I will try to use data to answer these questions.
The dataset used for this project is taken from Kaggle. The dataset was already broken up into a train and test set, each set containing 79 feature variables and 1,460 observations. Each observation in the dataset describes the sale of a house in Ames, Iowa from 2006 to 2010. The dataset contains feature variables that are numeric, ordinal categorical, and nominal categorical.
Getting to Know the Data
To get a gauge of the data I was working with, I started off with doing some basic exploratory analysis. First, I wanted to see if any values in the dataset were missing. The plot below show columns in the training set that contained missing values along with the number of missing values they contained.
While there may appear to be a large number of missing values, many of the columns have missing values that represent houses that did not have a particular feature. For example, the missing values for pool quality ("PoolQC") actually just represent houses that do not have pools. This is also the case for feature variable like fireplace, alley, fence, etc. For values such as these, I either replaced them with 0 or dropped them from the dataset. Feature variables that contained missing values not due to the situation above were imputed using a random forest model. The input to the random forest model were all other independent variables in the dataset.
Scatterplots of Numeric Features Data
The plots below are some of the numeric features in our dataset plotted against the target variable sale price.
Looking at the plots above, we can see that a lot of the independent variables in our dataset appear to have positive correlation with sale price. One other thing to notice is the presence of outliers. We can definitely see some points that fall very far away from the other points.
Feature Value Counts Data
The plots below show the value counts for some of the feature variables in the dataset.
Looking at the plots above, we can see that some of the feature variables have value counts that are very imbalanced, where one value appears a lot more frequently than the other values. It may be worth it to drop such features as they will likely not provide too much information.
Distribution of the Target Variable
Below is a plot of the distribution of the target variable sale price.
Looking at the above plot, we can see that distribution of sale price is a little right-skewed. Performing a log-transformation on the sale price column can help make its distribution more normal. By normalizing our target variable, we can help make the residuals of our regression models more normal as well. Normal residuals are an underlying assumption of regression. Meeting this assumption can help the accuracy of our regression models and the reliability of our coefficient estimates.
Data on Feature Engineering
There are additional feature variables I created by combining other variables in the dataset. One such variable I generated was the total square footage of the house. I created this variable by combining the square footage for all floors and all outside property of the house. I also combined the full-bath and half-bath variables by summing the number of full-baths by one-half the number of half-baths. In addition, I generated binary variables that were 1 if a house had a particular feature and 0 if a house did not have a particular feature. For example, I created a column called 'hasPool' where a value of 0 represented a house with no pool and a value of 1 represented a house with a pool.
Data on Feature Selection
Looking at the image below, we see pairs of variables that have the highest absolute correlations. There is a high amount of multicollinearity in our data.
While multicollinearity might not be a huge issue for prediction purposes, it will definitely be an issue for statistical inference purposes. Multicollinearity will increase the variance of our coefficient estimates and make them unreliable. Thus, we want to select features in a way where multicollinearity is reduced and only the most important variables are chosen.
I utilized Python's SelectKBest method to test for the individual effect of all feature variables. The correlation of all variables was computed against the target variable and then converted to an F score. The sorted bar chart below shows the returned scores. Higher scores indicate a variable that is more important.
From the above plot, we can see that variables related to the house's square footage appear to be most important for predicting sale price.
In addition to the univariate selection method above, I also utilized the returned importance scores of models such as ridge regression, lasso regression, and gradient boosting regression. For ridge and lasso regression, the magnitude of the coefficients returned are used as the importance scores. For gradient boosting regression, importance scores are determined based off how well a variable did in reducing variance during decision tree building. The scores returned for all three models are shown below.
The above plots generally choose the same features as being important. Variables such as total square footage, overall quality, and the year a house was built appear to be important in determining price.
Tree Based Models
Encoding Categorical Columns
In preparing my data for the tree based models, I ordinal encoded all categorical variables where values in each column were now represented by an integer.
After ordinal encoding the categorical variables, I removed outliers using Cook's distance. Cook's distance essentially measures the influence of a data point. More specifically, it tells us how much a regression model changes when a observation is deleted. A higher Cook's distance indicates that a particular point changes a regression model a lot when it is deleted. I removed all observation's with a Cook's distance greater than one. The observations that were deleted and are shown below.
The two tree based models I used were a random forest regressor and a gradient boosting regressor. Using 5-fold cross validation, the random forest regressor resulted in a mean r-squared of 0.892. The standard deviation of the r-squared across all 5 folds was 0.019. The gradient boosting regressor resulted in a mean r-squared of 0.910 and a standard deviation of 0.019.
Encoding Categorical Columns
In preparing my data for the linear models, I ordinal encoded all ordinal categorical columns and dummy encoded all nominal categorical columns. Dummy encoding the nominal categorical columns resulted in a new column for each category in the variable. Each observation in the new columns contained a 1 if an observation took on that category and a 0 if an observation did not. I dropped one of the dummy encoded columns to reduce multicollinearity.
Scaling the Data
To get a better interpretation of the importance of the coefficient estimates, I standardized all the feature variables in the dataset where each column now had a mean of 0 and a standard deviation of 1.
Log Transforming Sale Price
As mentioned earlier, the distribution of the target variable sale price was slightly right-skewed as shown below.
When running a regression model on sale price, and then plotting the fitted values against the residuals, we obtain the results shown below.
The above plot shows a pattern in the residuals and indicates to us that they are not normally distributed. Running a regression model after we perform a log transformation of the target variable results in the residual plot below.
The plot above shows no pattern in the residuals and better meets the assumption that the residuals follow a normal distribution. Thus, I log transformed the target variable sale price.
The same process described above for removing outliers using Cook's distance was also applied here.
For all linear models, I used a subset of features that reduced multicollinearity and kept only important features.
I tested numerous linear models. The linear model I believed to be the best in terms of prediction accuracy and reliability of coefficient estimates was a linear regression model that used the features selected by lasso regression. The model had a mean r-squared of 0.914. The standard deviation of the r-squared across all 5 folds of cross validation was 0.009. The coefficient estimates are shown below.
Each coefficient estimate can be interpreted as the increase in the log sale price for a one standard deviation increase in the explanatory variable. Because this interpretation is somewhat difficult to understand, I converted these results into actual dollar impacts which are shown in the table shown below.
The values in the right columns can be interpreted as the dollar impact of a one-unit change in each explanatory variable on the average house price. For example, a one unit increase in a house's total square footage will increase the average house price by around 19 dollars.
As one might expect, houses built more recently will increase a house's selling price more than older houses. The number of fireplaces in a house and the house's heating quality also appear to be important predictors of its price. Since Iowa can get very cold in the winter, these features may be important to people there. The most important feature for determining house price is the total square footage, with a one unit increase in the total square footage increasing the average house price by around 19 dollars.