Machine Learning Project: Predicting House Prices
Overview:
In this project, we predicted home prices as part of an ongoing Kaggle data science competition. The project utilizes a unique dataset that includes a number of variables for each house sale. To structure the process, our team broke the project into the following broad stages: pre-processing/exploratory data analysis, feature engineering, modeling stage, and finally an evaluation stage. We summarize our steps below.
Exploratory Data Analysis:
The dataset we used in this project was originally published by Dean de Cock (2011) and contains information on house sales in Ames, Iowa. The data were split into a training dataset containing 1,460 sales and a test dataset with 1,419 sales. There were 80 variables, including the sale price, including 20 continuous, 14 discrete, 23 nominal, and 23 ordinal variables.
Our initial step was to summarize the distribution of these variables to assess whether any transformations were required during our feature engineering phase of the project. As the sale price is what we seek to predict, it is a natural starting point to explore. We found that the distribution of the sale prices in our data was quite skewed and clearly needed to be transformed. The figure below shows the raw sale price distribution in the data. Naturally, we chose a log transformation so that the transformed outcome would be normally distributed.
We then also sought to better understand the correlation among the variables in our dataset. The heatmap below illustrates this structure. In this chart, darker (lighter) colors indicate a larger (smaller) correlation between two variables. The bottom row indicates the correlation between sale price and the various features included in the data. The correlation structure confirms one’s intuition, with variables such as overall quality, age, and size being highly correlated with price.
Nonetheless, the figure above does also illustrate the need to careful examine the data before featuring engineering: there are two variables that are supposed to measure quality: overall quality and overall condition. Interestingly, the overall quality variable is highly correlated with price whereas the overall condition variable is not. It also shows that in other cases, there are several variables that measure similar aspects of the house and are all similarly correlated with sale price (e.g., garage area in square feet and also by no. of cars).
We also visually examined how our features correlated with sale price. Below we show the plot that relates the log sale price and total square feet. The heatmap above indicates a positive correlation, but the chart below helps inform whether a linear model is likely to predict house prices well. While there are some outliers in the relationship between price and house size, the chart suggests that a linear relationship should do a good job at capturing house prices in our testing.
Beyond the linear relationship above, we also examined prices across different neighborhoods in our database as illustrated below. This figure shows that there is considerable variation in our data across neighborhoods. For example, the median home price is nearly three times larger in some neighborhoods relative to others. The figure also shows that many neighborhoods also have much more variation in prices than others.
Pre-Processing and Imputation:
The degree of incomplete observations varied throughout the data and required thoughtful addressing. The figure below illustrates the large disparity in incomplete observations within the data. For example, nearly all observations for pool, alley exposure, and the presence of a fence are incomplete. The other variable with a significant percentage of incomplete observations was fireplace quality (~50% are missing).
We took several different approaches to deal with missing observations, and later for feature engineering, depending on the context. In this regard, our analysis in this regard started by analyzing the type of each variable (e.g., numerical versus categorical) and then working to develop sensible approaches for handling missing observations. A useful breakdown for the types of decisions we made reflect the various data types we encountered. Numerical observations generally fell within the following: areas, counts, dates, ratings, and monetary values. Categorical variables generally fell within conventional categorical variable and ratings.
Imputation of Missing Data – Categorical Variables:
We imputed categorical variables within missing values keeping in mind several considerations: the number of categories within the variable, the representation of the median category within the variable relative to an incomplete observation, and the frequency of missing values in a given variable. The table below provides visibility in to our methodology.
In this table, the left-most variable in each panel lists one of several different variables with missing observations. The following three variables cover the percent of missing observations within that variable, the percent of observations that have the most common label, and the number of category values. We color-coded each variable according to whether 1) the variable was dropped (red), II) missing values were imputed with the median value of non-missing observations (yellow), or missing values were imputed with a value of zero (green).
Our general approach can be summarized as follows: we dropped observations with a very large percentage of missing observations and we used the median value for imputation of missing observations for variables with limited missing values. Careful examination of the data through this process led to several discoveries that helped formalize both our imputation process and also our feature engineering approach:
- Many garage-related variables with missing values were shared within observations that actually had a garage
- In many cases where basement-related variables were missing, one could infer based on other observations that the house lacked a basement altogether
- In some cases, multiple variables seemed to capture the same attribute(s) of the house, with some appearing more useful than others
Imputation of Missing Data – Numerical Variables:
Imputation of missing values for the numerical variables was simpler than for the categorical variables. Below we present a similar table as for our categorical variable imputation process, which shows how we imputed several key variables in our analysis.
The left-most column is the variable name, followed by the percent of missing observations (middle column) and the number of missing values (last column). Only two variables feature a significant degree of missing values: lot frontage and garage year built. Missing values for most variables were imputed, with the one exception being the garage year built variable. Garage year built was dropped for two reasons: its high correlation (~0.84) with the year the house was built and also because a potential imputation method was unclear.
Feature Engineering:
Our feature engineering process can be summarized as follows. We initially found that many variables reflect a common home feature and found it useful to categorize these variables as follows (the count of variables within the category is also provided):
Category: | No. of variables: |
Sq. footage | 18 |
Basement | 11 |
Outdoor space | 10 |
Garage | 7 |
Room counts | 7 |
Based on this categorization, for example, two features were developed that we felt better reflected what a home buyer is interested in:
- Total square feet = first floor + second floor + basement square feet
- Outdoor square feet = wood deck + open porch + enclosed porch + 3ssn porch + screen porch square feet
All components of the hand-crafted features were subsequently dropped from the dataset. Other hand-crafted features were considered. For example, we considered combining full and half bathrooms into one variable. However, we ultimately left them separate as combining them would implicitly assume that the effect of two half bathrooms would be the same as one full bathroom. The remaining twelve categorical variables considered to be a rating were encoded ordinally, with missing values encoded as 0. The other remaining categorical variables were encoded using one hot encoding. This process resulted in 243 features overall.
Modelling Stage:
We implemented a range of model to predict housing prices. Our exploratory data analysis highlighted that for many key features, the relationship with house prices was clearly linear. As such, we started with a simple linear model using ordinary least squares (OLS). We also considered a couple of other models as starting points including a decision tree model and gradient boosting models.
While useful starting points, these models were further improved to enhance out-of-sample predictions. We approached this in several ways. First, we utilized some linear models with added penalties (e.g., Lasso and Ridge models) and also a random forest model. Second, for any given model, we also performed cross validation within a parameter grid search in order to select the hyperparameters for each model that generate the best predictions (i.e., lowest root mean-squared-error) in our cross-validation sample. Lastly, we also explored assembling the models in different ways in order to leverage the advantages of different models and, thus to (hopefully!) make better predictions. Our results are summarized below:
For each model, the table above indicates the root mean-squared-error (RMSE) from the cross-validation test datasets: the errors presented below are the average test error across the k-folds. In general, we found that our OLS model performed relatively well. This model had a test RMSE of 0.12729, which was similar to many of the other more complicated models we fit. We found that both the decision tree and random forest models tended to overfit the training data to some degree.
Model: | Test RMSE: |
OLS | 0.12729 |
Ridge | 0.1193 |
Lasso | 0.1240 |
Decision tree | 0.1880 |
Random forest | 0.1431 |
LightGBM | 0.1155 |
XGBoost | 0.1146 |
Ensemble | 0.11858 (Kaggle) |
Across all of the models, one of our gradient boosting models yielded a test RMSE of 0.1146, which was the lowest overall on the test data. Using the Kaggle test data, however, our best model was simple ensemble of our Lasso and Ridge predictions.
Variable Importance:
While the ultimate goal of this project was predictive in nature, we also examined which variables were most important for predicting house prices. We first looked at what effect regularization had on the coefficients of the variables. The graphs below illustrate the impact of regularization on both the Lasso and Ridge coefficients. In these charts, we have normalized the explanatory variables to allow for meaningful comparisons of the coefficients. We have also provided the best values of lambdas as determined by a 10-fold cross validation grid search.
These charts illustrate that the best value of lambda for the lasso model is lower than that of the ridge model, but it is difficult to tell from these graphs alone which coefficients are meaningfully different than zero. To that end, the following chart shows the top eight variables that the lasso model chose as having a positive impact on the price of houses, along with the values of the same coefficients determined by the ridge model. While we initially thought to extract the p-values of the coefficients, we settled for simply comparing the magnitude of the coefficients themselves.[1] As such our interpretation is qualified to some degree by the mental asterisk that size does not necessarily imply significance.
According to the lasso model, the variables that were most informative in predicting house prices were the neighborhood (with houses in the Crawford, Stone Brooke, and North Ridge Heights neighborhoods more expensive on average), exterior type (with houses having brick face exteriors being more expensive on average), and overall quality. The ridge coefficients for these same variables, while different, largely support the lasso model’s selection of impactful variables. Interestingly, when looking at the coefficients the ridge model selected and comparing the values the lasso model selected, a divergence emerges:
The ridge model appears to have selected variables that intuitively should have little impact on the price of a house, or at least not be major drivers of housing prices. Indeed, upon closer inspection of the data, the coefficients selected by only the ridge model each correspond to only a handful of observations, and thus should probably be ignored. When looking at variables with the top negative impact on house prices, however, there is agreement between these models as shown in the figure below.
Both the Lasso and Ridge models identified the same top factors negatively affecting house prices, but as with the positive coefficients chosen by the ridge model, most correspond to a handful of observations and should probably be dismissed. We then took a look at the variable importance across five different models: lasso, ridge, random forest, light gradient boosting, and the XGBoost model. While the importance metrics for each model are not comparable in general, we do see some trends emerge: neighborhood, overall quality, total square footage, and garage area among the more important factors of house prices. The figure below captures the important features across these models. This largely agrees with the intuition we developed for these variables during the EDA phase of the project.
Conclusion:
In summary, our goal was to predict house prices using a rich dataset on home sales in Iowa. We explored a range of different models, from relatively simple to more complex ones, with the goal of best predicting home prices but also utilizing different data science skills in the process. We found that a simple linear model performed quite well (among the class of models we considered). Some more complicated models faired relatively worse and some relatively better than our simple approach. Overall, the results illustrate the value of exploring a range of different models and methods/steps in this process.
[1] While this paper proposes a method, neither sklearn nor statsmodels have implemented it.