Kaggle Competition: The Strength of Linear Models in Predicting Housing Prices in Ames, Iowa
The Ames Housing Dataset was introduced by Professor Dean De Cock in 2011. It contains 2,919 observations of housing sales in Ames, Iowa between 2006 and 2010. There are 23 nominal, 23 ordinal, 14 discrete, and 20 continuous features describing each house’s size, quality, area, age, and other miscellaneous attributes. For this project, our objective was to apply machine learning techniques to predict the sale price of houses based on their features. Here is the procedure we followed to tackle the project.
- Understanding the task at hand
- If we had the resources, we would start by digging deep to understand the root cause of the problem we are trying to solve. This is imperative because we would want to make sure that our model isn’t simply solving a symptom of a larger problem. The model should address the root cause of the problem.
- Propose/brainstorm solutions
- For this project, we knew that the solution to predicting sale price of a house would need to employ various machine learning techniques. If we were data scientists at a company, however, a fundamental part of proposing a solution would include getting buy-in from stakeholders. This may mean performing a cost/benefit analysis to demonstrate that the benefits of the model would outweigh the huge upfront cost.
- Collect Data
- Fortunately for us, the data was readily available and didn’t require any consolidation. If we were working at a company on this problem, it’s possible we would have to acquire data from every corner of the company and maybe even get ahold of industry-wide data. If we had acquired our own data, it would be important that we understand how to join the data and that we have discussions with the various data owners so that we can compile a comprehensive data dictionary that would be useful for future steps of the modelling process.
- EDA
- In order to get a thorough understanding of the data, we performed exploratory data analysis on all the variables in our data set. The EDA we did involved:
- Identify and treat missing values within each variable
- Identify and treat outliers
- Frequency charts of all the variables to see how they are distributed
- Graphs of the target variable vs. each feature to see which features may be most predictive
- Correlation matrix to see how variables are correlated and identify if multicollinearity could be an issue.
- In order to get a thorough understanding of the data, we performed exploratory data analysis on all the variables in our data set. The EDA we did involved:
- Model training and variable selection
- We trained various models to predict sale price of a house. We let the models systematically select features that would be predictive of sale price and confirmed that these features lined up with our expectations from the EDA
- Model validation
- We validated our data by performing k-fold testing where k = 5. We used root mean squared error (RMSE) to validate our model
- Pilot Plan
- Had we created this model for a company, we would have worked with the business to figure out how to best implement the model. It would be important that we minimize business interruption of the current process. We would create a pilot tool for our customers to use and evaluate the impact of the pilot.
- Valuation of model benefits
- After rolling out the pilot plan and correcting any errors with the pilot, we would implement the model as a whole. In this step, we would compare the benefits of the new model compared to the old model to ensure that the new model does in fact add value to the business
- Management and monitoring of full implementation
- Last, we would create a schedule to regularly monitor the model’s performance and update the model as needed.
We coded everything in Python using classes. This made it each for group members to add methods into classes. Also, this kept our code very organized.
Cleaning/EDA
Let’s first go into how we handled our missing values in this dataset. Below see two visualizations that show which parameters had missing values. The nullity correlation matrix on the right allows us to see which values are missing in correlation with each other (and not the correlation of their actual values).
For most of the missing, we imputed “None” or zero. These were generally descriptive parameters for parts of the house that these particular houses didn’t have. For the more random missing variables, we imputed either the mode or the mean. The mode we used because these variables generally had a distribution, like Electrical shown below, where a huge portion of the observations fall into one category. We used the mean for unfinished square feet in the basement because it is a continuous variable.
Caption: A visualization of the missing values (left), a nullity correlation matrix (center), the distribution of categories in the parameter Electrical, on of many parameters in which we imputed the mode into the missing values, since 80% of the values fell into one category (top right).
The last parameter for whose missing values were imputed was ‘LotFrontage’, or the linear feet of street connected to the property, which was missing in about 17% of the houses. For this variable, we tried two different methods: extrapolate based on its linear relationship with ‘LotArea’ (sq ft), and extrapolate the median ‘LotFrontage’ from the neighborhood of each missing value (See the graphs below).
Both the log lot area and the median by neighborhood methods improved our model by an equivalent amount. We finally chose the latter for our models.
To start off our EDA, we looked at how housing prices changed over time. One might expect, for example, a dip in prices in 2008, based on the financial crisis in that year. Looking below, we can see that there were slightly fewer homes sold in 2008 compared to other years. However, on the right, we can see that the median sale price remained pretty constant over the five years. We also looked at the influence the month of the year (averaged over all five years). You can see there are many more houses sold in the summer months, but again the median house price stays the same over these months.
Caption: Because we had no values after July 2010, we calculated 2010 sales as if the sales continued on the same trend as before July 2010.
Feature Engineering
Categorical Features
Feature engineering is a crucial step in the machine-learning pipeline. Starting with categorical features, we used One Hot Encoding to perform “binarization” of the categories. This method worked best with linear models, since it doesn’t introduce false relationships between the categorical features. The second method was label count encoding, which assigns a unique ID to each category based on the number of values within each category. This method worked best for tree-based models, since it doesn’t increase the dimensionality of the data set. The third method used to deal with categorical features was applied only to features with ordinal nature where we assigned a numbered dictionary to preserve the unique relationships of the categories.
We decided to keep the combination of One-Hot Encoding and ordinal features because it improved both our linear models and our combined models.
Numerical features
For all 19 numerical features with skewness greater than 0.75 we performed Box-Cox transformation. For our target variable sale price, that was right skewed, we performed log transformation on the sale price in order to improve the normality of the distribution. These transformation were especially effective for linear models.
Feature interactions and outliers
Area is a feature with big influence on house pricing, so it was important to create a new area feature by adding the living area and the basement area. As the chart shows, the new feature had good correlation with the sale price.
Both marked points represents houses with large areas and low sale price. Upon further investigation, both of these observations were houses located in the Edwards neighbourhood, one of the low-priced neighborhoods. About 50% of their square footage was from their basement area, and both of these houses had unfinished basements. It’s possible that if their basements were finished, they would sell for 50% more ,given the increase in usable living area. Based on this information, we decided to remove these two observations to prevent the outliers from interfering with the modeln results. Removing there two outliers improved the scores of our models.
Model Selection
To evaluate the fit of our models we used root mean squared error (RMSE) for the log of the sale price, the metric used by Kaggle to evaluate submitted predictions. As a validation strategy, we did cross- validation with K=5, using 4 parts of our data for fitting our models and one part to obtain predictions.
For our individual models, we trained the following: Multiple Linear Regression, Kernel Ridge Regression (KRR), Lasso and Elastic Net(ENet) as linear models, then Random Forest, Gboost and XGBoost as tree-based models. Finally, we combined our individual models by stacking and ensembling. First we tried the simplest stacking approach by averaging base models. Then, we added a “meta” model. In this approach, we used the out of folds prediction of the base models to train our metamodel (in this case Lasso). Next, we added XGBoost and Random Forest to the previous stacked averaged models by performing a weighted average, calculating the beta coefficient with a Multiple Linear Regression.
Feature Importance and Coefficients
Considering that regularized linear models give less weight to less important features by assigning them a lower coefficient and tree-based models by setting a lower splitting preference, we decided to use all the features to train our models, allowing them to select the most predictive features themselves. We then noticed that while adding features, some times improved our scores, removing features usually reduced our scores. Therefore we decided to keep the previously described feature, “Total SF”.
Interestingly, exploring the feature importance of our Random Forest, we could see that Overall Quality (“OverallQuall”) has the highest weight by far, even though it is not the last feature to shrink its coefficient to 0 in a Lasso model when increasing lambda. The same hold for other features, such as “TotalSF” (feature created by us) and “GarageCars,” confirming that tree models and linear models have different preferences when it comes to feature importance, which is why we did not use any of these models to “pre-screen” the features for the other models.
In the figure below, the left panel shows the top 20 features ordered by feature importance in our Random Forest model, the middle panel shows the coefficient of the top 20 features correlated with Price Sold variable in our Lasso model and the right panel shows the coefficient of the basement related features with increasing lambda in our Lasso model. The red line is our chosen value for “lambda” selected by Bayesian optimization.
Scores Summary
Overall our models had values between 0.11697 and 0.12425 RMSE (Kaggle score), the highest cross-validation RMSE (or worst score) was for Random Forest, followed by Multiple Linear Regression. To our surprise, our linear models performed very well. The Lasso one delivered our highest score for individual models, surpassing the boosted tree models, Gboost and XGBoost. A simple average of the prediction of Gboost, Enet and Lasso improved the score of the individual models, taking us from 0.12109 to 0.11697. Adding ENet as a meta model the base models Lasso, Gboost and KRR gave us a slightly lower score (0.11702) compared to the simple averaging of models. Finally, adding to this stacked model two extra models (Random Forest and XGBoost), lowered our scored to 0.11858. Interestingly, our best performing combined model was a simple average of the predictions, which ended up being our best score (0.11697, top 14% of the public leaderboard as of today)
Conclusions
To summarize, this project demonstrated the strength of linear model over tree-based models when the target variable is linearly related to the predictors. We have shown how one-hot encoding, ordinal encoding, box-cox transformation, and parameter tuning improved our linear models. As a result, Lasso model was our higher scored individual model. A simple average of the prediction of our best simple models gave us our best score, in exchange of losing model interpretability.
Depending on the specific need of the project, simple or combined models have different advantages. For a Kaggle competition, where a 0.0001 score improvement is an great achievement, combined models work very well, taking advantage of the strengths of different models. However, when we build a model --both to make a prediction and to get information about the response variable -linear models have an extraordinary predictive power that can be explained and understood.