Using Data to Analyze Zillow Zestimate Competition
Data Science Introduction
For most people, buying a home is the single largest investment they will make during their life. The high impact of such a significant investment on every aspect of one’s life makes it critical for future homeowners to be fully informed about their potential house based on data. Luckily for all future homeowners, companies like Zillow have tried to address the problem of information asymmetry in the real estate industry (e.g. the discrepancy of knowledge on a product between the seller and buyer) by making estimated house prices (‘Zestimates’) on more than a 100 million US properties readily available on the internet.
However, not everything in the garden is rosy: as per today Zillow’s Zestimate still has a median error rate of 4,3%. While this means that that half of the homes are sold within 4,3% of Zillow’s estimated house price, it also means that the other half of the houses are sold for prices that can greatly deviate from these estimates. Inaccurate estimates can lead to high costs for both the buying and selling side (as it results either in overpayment or undervaluation) and have even resulted in a class-action case against Zillow by homeowners who charged that the Zestimate sharply undervalued their properties.
While the lawsuit was dismissed in court, it still points out the urgency to continuously improve the accuracy of the Zestimate.
Luckily, Zillow is very eager to address this: in their recent Kaggle competition the company has put $1 million on the line for those who can improve the accuracy of their Zestimate feature, measured by the log-error between the Zestimate and the actual sale price.
This blog post describes team Alpha’s journey on the prediction of the accuracy of the Zestimate, and on how we were able to beat 70% of the teams in the Zillow Kaggle Competition within a mere week.
To ensure a smooth process when attacking Zillow’s Kaggle Competition, we applied the Cross Industry Standard Process for Data Mining (CRISP-DM). In sum, the CRISP framework splits a project into six distinct phases (Business Understanding, Data Understanding, Data Preparation, Modeling, Model evaluation, and Deployment). Our blog is structured around this framework.
Zillow provided us two different datasets, one (‘properties’) contains the information of all the listed properties on Zillow in 2016 from three counties in California (Orange, Los Angeles, and Ventura). There are roughly 3 million observations in this data. The other one is a data set (‘transactions’) contains all the transactions in these three counties before October 15, 2016, plus some of the transactions after October 15, 2016.
As part of the data understanding, we have performed extensive EDA on the datasets. The first step of the EDA was used to obtain an understanding on the missingness (included in the graph below). Instead of simply discarding missing values, we evaluated by means of density plots for every variable included in the dataset whether the missingness in the variable contained information about the logerror (assuming that the logerror is caused by the missingness). We used this information in our data cleaning process as discussed below.
The EDA also pointed out that there was little to no (linear) correlation between the various independent variables and the dependent variable (logerror). This indicated that the logerror is not caused by one specific variable, but that we should rather look at parts of variables (e.g. during a specific timeframe) or for features that have not yet been included in the datasets that could potentially cause the logerror. An example of such a feature is the market sentiment. In 2016, real estate prices in California exceeded the pre-financial crisis levels.
High prices and low supply of real estate could result in a ‘heated’ market, where future homeowners are overbidding (potentially causing a discrepancy between the estimated value of a house and the sales price). When looking at time series of the logerror, there seem to be some indications for market sentiments, as the logerror increases over certain time periods. Given the limited timeframe of the data, it is not possible to generalize such a claim, but it is something we took into account in our feature engineering and data cleaning processes.
As what has been mentioned before, in this project, Zillow provided two datasets (properties, transactions) for us to predict the log-error between their Zestimate and the actual sale price. Before building the prediction models, we need firstly get a clean, usable data which can be fed to our model for training. However, as what has been mentioned in EDA, there are many columns contain considerably amount of missingness. So the first work in data cleaning which should be done would be related to dealing with those incomplete columns in properties data.
Well, imputation would be the best way to deal with those missing values. However, there’s one tricky point in here. We don’t know how did Zillow get the result of Zestimate from their models. I mean, it’s highly likely that the inaccuracy of Zestimate is somewhat related to the missingness. That is to say, the value we need to predict in here, the logerror, is just introduced by those missing values.
So it would be a very interesting that we could train the model while preserving those NA information to some extent in the data. However, we also need to provide an imputation version of the data, which would be more reasonable to train the model in common situation, like what we were always doing in most Kaggle machine learning competitions.
So, we have two approaches to get the clean data, the first one is trying to preserve the missing information, while the other one is to get rid of those NAs by doing imputation.
- First approach:
Preserve all the columns, made some reasonable imputations, e.g. area / num_garage, area / num_pool, etc.;
set the NAs to zero;
Shrink the number of levels by separating categorical columns to fit in different models, i.e. Rpart, RandomForest.
- Second approach:
Delete columns with more than 75% missingness;
Remove duplicated, highly correlated columns, which may cause collinearity;
Scale the geographical information;
Remove all NA property observations (11437 out of 2985217 properties observations)
Once the business goal and data is understood, the quick framework for feature engineering is set up. The process follows Agile Process to allow quick turnover of validation of the success of feature engineering based on iterative development.
Our first step is to exam the outliers based on MAD methods, Percentile methods, Absolute Value methods. Next, the removal of outlier allows the data follows a correct correlation. Finally, removal of logerror outliers has been successfully improved the leaderboard ranking. While it is not so successful for all other features.
The data sets provided by Zillow is relatively intuitive. The second step is to shrink the number of features by providing human inputs. For instance, there are 7 columns are about different types of room. And the same time, the data suffers a serious missingness. The weighted combine of features will solve the problem of missingness in a better way than imputation. Because of this, the result of the combination of similar columns is less missingness and more effective information for the models.
The third important steps would be the combination of similar levels for categorical columns. For instance, to highlight the differences among 3 major market in the LA real estate market, the 3 levels are expanded as 3 different features. If the important categorical has way many levels, it will be helpful to combine some of the similar levels.
As the all the 3 basic steps mentioned, an important way to evaluate the feature engineering is to set up cross-validation for multiple models. Such as linear, random forest and xgboost models. This allows you to check how different models react to the change of data and to check the most efficient way to increase your leaderboard ranking. This is similar to machine learning concept called greedy methods. Greedy methods are to take the step which allows the most improvement. As a drawback of this concept, you might get trapped in the local optimization.
We have tested the following models:
- Ridge Regression
Gradient Boosting Machine (gbm)
As a baseline for the evaluation of the different models, we used the MAE when the mean of the logerror (overall observations) is predicted for all observations, i.e. the same value regardless of the features.
The results of Ridge and Lasso Regression were not promising. This was not surprising because the EDA indicates that there is no linear dependency of the logerror from any feature. Perhaps it would have been possible to get a better result with more feature engineering but we decided to move on to other models.
For the decision trees, we applied the R package rpart and found no result. I.e. rpart was not even able to find any structure in the data which could predict the logerror. The best fit yield a tree with one node which returned the mean logerror. This indicated us how hard this task is.
The random forest was the first step forward by providing us a model with some predictive power, although the result was not overwhelming. It gave us a kaggle rank of 1581 (by the time of submitting).
We got our best results with the Gradient Boosting Machine and Xgboost, these approaches are described in more detail underneath.
ElasticNet model has been a success model for the Kaggle competition. ElasticNet is a linear regression model that combines L1 with L2 regularization. This combination allows for learning a sparse model where few of the weights are non-zero like Lasso, while still maintaining the regularization properties of Ridge. ElasticNet is useful when there are multiple features correlated with one another.
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.
The reason why we would like to try on xgboost in this project is that it’s FAST, which makes it much easier than other tree-based methods to proceed the cross-validation through few hours, rather than few days. However, before training the model, one thing could be done to potentially enhance the prediction performance.
As what has been shown in the figure below, the logerror from the original transaction data is very centralized around 0, which is very reasonable. However, this extremely centralized data make it somewhat impossible for the training algorithm to detect the difference of logerror between properties within an extremely limited range.
Many of the readers may be familiar with the kernel trick in the support vector machine (SVM). By using the kernel trick, one could possibly find a hyperplane in higher dimensional space to do the regression or classification problem by extending the distances between data through projection to a higher dimensional space (the figure after). For example, in the Gaussian Kernel:
σ, the kernel width in here, is a parameter which denotes the sparsity between the mapped points in the higher dimensional space. Generally, smaller σ indicates bigger distant, while bigger σ indicates smaller distance between projected data.
However, in this case, we are not trying to do the kernel tricks on the features, while trying to stretch the logerrors (the labels) with the sigmoid function, to make it more distant from each other, which could make way for better prediction.
In here we passed the logerrors to a sigmoid function, which could help us to stretch the data around 0, while at the same time, saturate the influence of the logerror which is far away from 0. The figure before shows the comparisons of the different sigmoid transformations of the logerrors. A bigger p indicates a wider stretched range of the logerrors, which makes it easier for the algorithm training.
In the xgboost training process, we set a 10-fold cross validation, with 100 searching iterations to find the best parameters of xgboost in each testing fold. With the trained model, we did the prediction on all the 9 million properties.
Data Analysis of the Prediction Error
While trying many modes with different settings we started to take a look at the distribution of the prediction error and it turned out, that there are similarities for all tries:
By placing the logerror on the x-axis and the absolute prediction error on the y-axis it gives a V-looking diagram. This indicates that the models tend to predict values around the logerror-mean which is the best they can found to minimize MAE. The figure above shows the result from a random forest, but the other models look similar, mostly with an even more clear V-shape.
Plotting the same information in another way yields the following figure:
On the y-axis, there is the absolute prediction error and on the x-axis, the log error rank, sorted by its absolute value. On the left side, there are the logerrors with the highest absolute errors. This diagram shows, that the observations with the highest logerror produce the highest prediction errors. This could be interpreted as normal because the high logerrors can be viewed as outliers, but on the other hand, these observations would have a high impact on the finale MAE. The red vertical line separates the highest 300 absolute logerrors from the rest.
The overall MAE, in this case, was 0.06694 (on a test set with 18055 observations). Without the highest 300 logerrors, the MAE would have been 0.05428, which is a huge difference in the Zillow case.
So it can be concluded, that especially observations with a high logerror earns more attention. If you succeed to predict them properly, then your overall MAE would decrease significantly.
After the deadline of the project we have investigated the data a little bit more and found an interesting characteristic by counting the number of missing values pre-row (NAs):
The mean absolute logerror increases it the number of missing values in an observation exceeds 28. This may serve as a delimiter for a separate treatment of the high-logerror cases.
Evaluating Zillow’s Kaggle Competition, our main conclusion is that predicting an already accurate predictor is quite a challenging job. In order to make a real difference in the accuracy of the predictions, we established that for all our models, feature engineering was key. And while GBM gave us the best results, we noted that XGBoost was generally the fastest and most practical model to use in terms of dealing with outliers and cross-validation.
As our analyses consistently showed that only a very small portion of the data is responsible for causing the largest logerrors, our main takeaway for future reference is to further focus on these observations to further boost our results.
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Team member: Nathalie Cohen, Stefan Hainzer, Summer Sai Sun, Yiming Wu