Predicting House Prices Using Machine Learning
Introduction
Our machine learning project was a Kaggle competition for predicting home prices in Ames, Iowa. The dataset had 1,460 observations each with 79 features for homes sold in Ames between 2006 and 2010. We had to predict sales prices for the test dataset of 1,459 houses in Ames which were also sold between 2006 and 2010.
While achieving a high level of prediction accuracy was important to place well in the Kaggle rankings, our team decided to also prioritize the interpretability of our results. As such instead of focusing on complex machine learning models that might result in the best fit, we framed the problem statement as what models would give us both high prediction accuracy and also highlight the key factors home buyers consider when deciding on how much to pay for a home. We believe that framing the problem in such a way makes our models more generalizable and as such we could use them to derive insights in other instances with a similar context.
Our team collaborated on the data exploration and data preparation stages of the project from which point we branched off and individually explored an arsenal of machine learning models best suited to meet our project goals. As a result, every team member got hands-on experience developing machine learning models and we also had at our disposal a selection of models to choose from, each with a different degree of accuracy and interpretability. Our chosen model was an ensemble of a Gradient Boosted (55%) and a Lasso model (45%) which resulted in a RMSE (root mean squared error) of 12.14% on the test data set and placed us in the top 20% of the Kaggle leaderboard (at the time of the submission).
Our blog post is divided into four main parts: Exploratory Data Analysis & Feature Engineering, Model Creation & Deployment, Conclusions & Future Work, and Relevant Links.
Exploratory Data Analysis & Feature Engineering
The data contained significant missing data, though most of it was intentional and simple to address. Several instances, however, posed trickier challenges. Aiming for greater accuracy, we noticed a numeric column, LotFrontage (denoting the footage of the lot that touched the road, rather than another property), varied greatly depending on the shape of the house and the configuration of the house (located in a cul-de-sac? Or perhaps at the intersection of two roads, therefore having two sides face the street?). We took the median LotFrontage values for each shape and configuration combination and imputed the missing values accordingly.
Regarding engineering, we took a very proactive, manual approach. We built heat maps to compare categorical data. We broke house sale prices into deciles and compared how many instances of each category fell into which deciles, to attempt to determine trends. Consider the following example: as you can see below, houses with no veneer are more likely to fall in the lower deciles for house prices, while houses having stone or brick veneers are more likely to fall into higher deciles. Based on this, we created a dummy variable asking if a house had no veneer, and another asking if a house had stone or brick veneer. Because this is not a guarantee, but simply a trend, we determined it was safe to use and posed a comparably insignificant risk of data leakage.
Model Creation & Deployment
As a starting point, we built a multiple linear regression model because of its interpretability power. This model will also be used to compare to other models' performance. Analysing the residuals plot of the MLR model, we see that outliers were present in the dataset. According to the plot below, we removed two outliers that had leverage and were influencing the model's result.
After the MLR, we implemented different regression models to reach better results. We tried Random Forest, XGBoost, Ridge, ElasticNET, however, none beat the results given by the Lasso and Gradient Boosting Model. All models were trained considering the original dummified dataset, the engineered dataset and a dataset with high power variables for some continuous features. For the Lasso model, we had to standardize all variables eliminating the influence of the absolute value of a variable over the model. For tuning the hyperparameters, we used the grid-search with k-fold cross-validation method. The result for both models are summarized below:
Model | R2 | RMSE | Variables used (Total 323) |
Lasso | 0.94 | 0.115 | 189 |
GBM | 0.92 | 0.11 | 36 |
Since the two best models do feature selection, we can look at the most important variables for each one. There is an overlap of variables that are key to both models even though the dataset used was different. We observe that the living area, lot area, year built, overall condition and quality have a great influence on the models' results.
Considering that some features are just used in one of the models, we realized that ensembling both we could get a better performance. For that, we used a weighted average to get our final result. The chosen weights were 55% for the GBM output and 45% for the Lasso.
Conclusions and Future Work
As a conclusion, the topics are: Feature Engineering and Feature importance, Data Process Flow, and Iterative approach, K means and Categorical variables and finally Next Steps.
About Feature Engineering and Feature importance: a manual Feature Engineering is good to have a better understanding of the data-set, but the "original features" can provide valid information as well. For the next step, exploring the original features and features created all together, and let the model give the Feature importance, can bring a new result in terms of reducing the error.
The Data Process Flow was strongly Iterative, this Iterative approach involves starting on the most basic aspects of the problem, such as reading features definitions, followed by feature engineering and then train model/ evaluate error. The error evaluation often gave us ideas to modify our feature engineering, even to revisit our understanding of the original features. So, we were able to train the model again, and have a new error evaluation, this process became the dynamic to find a better model.
To explore the categorical features, we had the idea to create clusters using unsupervised techniques, to do so our first try was using K-means. But there is a specific algorithm to create clusters using categorical features, called K-modes. The K-modes gave us clusters, but they didn't perform well in feature selection.
The Next Steps would involve, "hybrid approach" to combine engineered features and original features, stacking to combine 4-5 best performing models, Bayesian optimization for tuning hyperparameters (mostly in tree-based models) and Investigate if external data sources can help increase explanatory power of the models (Eg: Interest rates, employment data, income, etc.).
As for next steps, we can first try to feed the models all variables we have created as well as the original ones. This way we would let the model would select the features that are important for reducing the error. We could also try a different ensembling technique like stacking to see if the results would be improved. Since tree-based models have many hyperparameters, we could have to use Bayesian optimization instead of grid-search to get a better tuning result.