The Price is Right
Predicting home sale prices in Ames, Iowa using machine learning
An introduction to the data
"Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad."Kaggle.com
As one begins their data science journey, it is nearly a rite of passage to become familiarized with the Boston Housing Data Set which has long been used to train and test students on regression concepts. The data included in the Boston set is often used within the scope of final assignment during regression segments of Statistical Methods courses. However, as time passes the Boston data becomes less relevant to important variables in contemporary markets. The most obvious indicator of the previously mentioned presumption is the steep increase in housing prices in the three decades since the release of the Boston Housing data. The dataset is also limited by its size and scope, and lacks additional information that may be beneficial in feature engineering.
In response to the need for a more contemporary data that can be used to implement a greater variety of machine learning techniques to a new generation of data scientists, Dr. Dean De Cock of Truman State University compiled the Ames, Iowa Housing Data Set. The data set contains 2930 observations and a large number of explanatory variables (23 nominal, 23 ordinal, 14 discrete, and 20 continuous) involved in assessing home values. The data serves as the basis for one of the more notable competitions on Kaggle and has produced thousands of individual kernels, with a wide variety of methodologies used to best predict the sales price of homes in Ames with respect to the 81 variables provided.
Exploring and analyzing the data
Looking for Outliers
Let's get a quick look at the sale price in relation to the ground floor living area (GrLivArea). On the left, we can see that there are two observations with high GrLiveArea values and low SalePrice values. Furthermore, these data points do not seem to follow the general linear trend that we see in the rest of the data. In this scenario, we will remove them from our dataset in the anticipation that it will allow us to build a stronger prediction model later on. The scatterplot on the left shows the dataset before the outliers were removed, and the plot on the right shows the dataset after it has been cleaned.
A look at the target variable: Sale Price
Let's start with a simple histogram of the Sale Price variable. From this histogram we can deduce that it deviates from the normal distribution, has an appreciable positive skewness, and displays peakedness.
In order to properly analyze, process, and utilize the Sale Price data, we need to transform it to better represent a normal distribution. In this case, we will apply a log transformation of the data. The figures below reflect the distribution and probability plots after log transformation of the Sale Price variable.
The skew now appears to be fixed and the data appears to be more normally distributed.
Feature analysis and engineering
Now that our target variable is transformed and normally distributed we can now look at our predictor variables to discover any missingness, collinearity, and to engineer new variables to better represent our predictors without suffering from the curse of high-dimensionality.
Looking at correlation between the predictors and target
This correlation plot shows the relationship between a selection of variables including predictors (x) and the target (y). We can quickly visualize relationships of some predictor variables with one another, such as "GarageYrBlt" and "YearBuilt". The relationship between these two variables appears to be highly positively correlated as it appears as a lighter white (>.75 correlation) square in the matrix. We are also interested in any variables that might be highly negatively correlated with one another, such as "EnclosedPorch" and "YearBuilt". These variables appear as dark squares (~-.50 correlation) in the correlation plot.
We should also take note of the correlation between SalePrice and the selected predictor variables shown in the correlation plot. This may give us a quick indication as to what variables may be important when we begin to put our prediction models together.
Searching for missing values in our predictor variables
We did a thorough search of the dataframe to discover which, if any, variables were missing values for any observation and performed a calculation to discover what percentage of the observations for that variable included missing data.
There are several features (PoolQC, MiscFeature, Alley, Fence, and FireplaceQu) where at least 50% of the observations for that variable include missing data. We may consider removing these variables altogether as we progress through the feature engineering process, but for now we will impute the missing values.
Imputing missing values
We found that many of the missing values for categorical variables were not, in fact, missing. Rather, “None” was one of the response categories for that feature mainly because the house did not have that feature like a basement or garage. For those variables, such as BsmtQual and GarageType, we imputed the value “None” so that it was recognized as a string and not a missing value. A similar issue with missing house features resulted in imputing zero as values for numerical variables such as GarageArea and TotalBsmtSF. Lastly, imputation of the mode value for categorical variables such as Electrical and Utilities because the large majority of those values were one type of response and there were very few missing values for that variable.
Engineering new features
To handle any potential multicollinearity, we combined several sets of similar predictor variables to produce the sum of their overall values. For example, we combined the total square footage variables for each observation into one variable labeled "TotalSF". Additionally, we combined the total count of various bathrooms into one variable labeled "TotalBath".
Looking for skewness in predictor variables
A quick look at the skewness of our features shows that 59 numerical variables are highly skewed and will require boxcox transformation.
We will use the scipy function boxcox1p which computes the Box-Cox transformation of 1 + x. Note that setting λ=0 is equivalent to log1p used above for the target variable.
The dataset is now clean and ready for prediction!
Applying machine learning models
Once we had a clean dataset to deploy various models on, our team branched out and took unique approaches before coming back together to build the best overall model out of the sum of our parts. Below, each author will briefly discuss their model selection and application process.
I used a similar approach to feature engineering to my teammates with one addition. I decided try sklearn's Recursive Feature Elimination. RFE is first trained on the initial set of features, assigns weights, and then the least important features are pruned from the current set. After pruning, RFE ranks the features based on importance. Based upon the rank I would use the features that ranked 10 or higher. Using the ranked features, I sought to use Multiple Linear Regression and Ridge Regression as my models for testing. I will need to continue feature engineering as they did not perform as well as expected.
Additional Feature Engineering Note: In addition to the steps described above (combining variables with describe similar aspects of the house), I changed ordinal variables with string values into numerical values so that I could include them in a correlation matrix. Before and after dummification, I dropped variables that had a lower correlation to SalePrice than the Id column, with the understanding that it has less correlation than random noise and would not provide any useful information to predict the SalePrice value.
Models: Since there seemed to be signs of multicollinearity in the data, seen by the correlation matrix, I knew that penalized regression would help mitigate these violations of assumptions of multiple linear regression. Also, by observing other submissions with worse outcomes using tree-based models, I was fairly confident that penalized regression would result in the best predictive performance. Therefore, I performed Ridge and Lasso regressions with similar results.
For Ridge regression, I tested 10,000 different alpha values between 10^(-5) and 30 and the best alpha, where the test RMSE was closest to the train RMSE (a sign of not overfitting or underfitting), turned out to be 14.4434. The model score was 93.14% and the resulting Kaggle score (RMSE) was 0.12223.
Lasso regression resulted in similar results. After testing 1,000 values of alpha between 10^(-4) and 10^(-3), the best alpha was 4.360E-4, the model score was 93.01%, and the resulting Kaggle score was 0.12239. Therefore, the Ridge model was slightly better than the Lasso model in this case.
The better score of 0.12223 gave us a rank of 1328 out of 4790 competitors, within the 28th percentile.
After narrowing down the continuous, numerical features in our feature selection and engineering phase, I performed a principal component analysis with the aim of creating new features that I could add back in to my data set and use in the final model. Using PCA, there were a total of 5 new features that accounted for over 80% of the explained variance in the data. Taking these 5 new features created from a PCA analysis, along with the continuous and categorical variables from previous steps. I moved on to building a model.
Because the new features were sure to introduce multicollinearity to the data, I used linear regression with an elastic net penalty term and a random forest regressor which are more robust to correlated features. Using a grid search to optimize the hyperparameters of the linear regression, I was able to find the most optimal values of lambda and rho for linear regression, and the optimal number of trees, minimum samples split, and minimum samples per leaf for the random forest regressor. To insure the hyperparameters were sufficiently optimized, I started the grid search with a more broad range of hyperparameters and narrowed the search as the results came back. Optimizing for both R², mean squared error and mean average error I was able to find the best model for minimizing each of the loss functions.
Not surprisingly, the linear regression performed better than the random forest model when optimizing for each different loss function. This could have a few different explanations but most the most likely reason is that the relationship between house prices and the explanatory features is a linear relationship and therefore linear regression is best at explaining the relationship. The predictions submitted to Kaggle using the linear regressor and random forest models resulted in a RSME of .14 and .20 respectively.
I applied base models from the sci-kit learn package including: ElasticNet, Lasso, Kernel Ridge, Gradient Boosting, and XGBoost, and LightGBM. The models were made robust to outliers by applying sklearn's Robustscaler() method within the make_pipline() method. By doing so, I hoped to reduce the sensitivity of each model to potential outliers. After evaluating the cross-validation rmsle errors for each model the results are as follows:
- Lasso: .1129
- ElasticNet: .1115
- Kernel Ridge: .1166
- Gradient Boosting: .1192
- Xgboost: .1161
Conclusion and Future Work
While our models tested above proved to be fairly accurate, their performance kept us just outside of the top 20% of the submissions for the active Kaggle competition. Many of the higher level submissions seem to utilize stacked models and variations of random forest regressors which seem to perform well on Kaggle overall.
Our future work to improve our model will include building robust stacked models. One such approach would be to build a new class that ultimately takes the average of the performance of multiple models accepted as arguments. Follow our github to review our python code and check back in periodically to view updates to our research!