Kaggle Competition: Predicting Realty Price Fluctuations in Russia’s Volatile Economy
The goal of the Sberbank Russian Housing Market Kaggle competition was to predict realty price fluctuations in Russia for Sherbank, Russia’s oldest and largest bank. Sherbank provided Kagglers with a rich dataset that included housing data and macroeconomic patterns (a total of 200 variables and 30,000 observations).
The Exploratory Data Analysis (EDA) for this project was not as extensive as the other projects. A lot of Kagglers shared their exploratory analyses on the platform, sparing us the need to do an extensive EDA.
Quick overview of the dataset
Aside from the high number of variables, the main issue with the dataset was the missingness.
The matrix below helps visualize patterns in data completion. The columns represent the variables in the dataset, while the rows represent the observations.
Except for full square footage (the first column in the matrix), none of the variables were fully populated, as denoted by the white spaces. It is also hard to find observations that have values for each variable. In fact, less than 20% of the observations were completely populated.
The dendrogram below helps to more fully correlate variable completion; it groups variables against one another by their nullity correlation.
Cluster leaves that are grouped together at a distance of zero fully predict one another's presence. One variable might always be empty when another is filled, or they might always both be filled or both empty, and so on.
Cluster leaves which split close to zero, but not at exactly at zero, predict one another very well, though still imperfectly.
For our dataset, we see that the nullity of the build_count variables are correlated, which makes sense but does not provide much insight. However, we see that variables related to house characteristics (i.e. square footage of the kitchen, number of rooms, material and max floor) predict one another’s presence. We learned later in the process that these variables were important in predicting prices accurately. However, as shown in the dendrogram, if an observation is missing one of these variables, it is probably missing the others, making this observation useless.
This is something to keep in mind when imputing. If an observation is missing the number of rooms and probably a few more variables related to the house characteristics, is it better to impute the missing values or to drop the observation? The latter approach seemed more accurate. This is why we considered the observations that were complete in some situations.
Our priority for this project was to uncover as many insights as possible. This is why we decided to stick to models that are easily interpretable and avoid models such as XGBoost which may help get a high score on Kaggle but are still essentially black boxes.
In this blog post, we'll highlight our top 3 insights from this Kaggle competition.
The fewer features, the better
The first step was to identify the important variables and dismiss variables that were merely noise. We started with a basic model that would serve as a baseline for our predictions.
Rather than start with imputation (a daunting task given the missingness within the dataset), we first investigated a reduced dataset consisting of complete observations to try to identify important features. We proceeded with caution, bearing in mind that this reduced dataset was likely biased and potentially not a good representation of the original dataset.
Below is a diagram that represents our feature selection workflow. The main goals were to:
- Yield a small number of predictors to then be used for basic Linear Regression/Tree Models with high interpretability.
- Involve a variety of feature selection methods to familiarize ourselves with the different ways to feature select and better understand the strengths/weaknesses of each method.
It should be noted that, in general, one shouldn’t expect that combining “important” features from two different types of models should yield a set of features that are overall “important” in any one model itself.
A side note on variable importance with Lasso
To select a list of “important” features according to this method, we slowing increased a penalty parameter λ to “shrink” parameters {βj} in the objective function which is to be minimized:
Due to the l1 norm used in Lasso, and how it affects the shape of the region of the parameter space determined by λ, many of the coefficients shrink to exactly zero as λ increases (in contrast with Ridge Regression). In order to feature select, we sought a value of λ that produced a model with approximately 20 non-zero coefficients. The following plot shows how the coefficients vary (in general, decrease) as the penalty λ increases:
We trimmed down the list further from 20 to 11 features using “forward selection”, and again from 11 to 4. The main goal was to get a good sense of our dataset by playing with it a little bit before refining our model. Our results are summarized in the table below:
Model Type: |
Training Error (RMSE): |
Kaggle Error (RMSE): |
Linear Regression:
11 Features |
.4700 | .358 |
Lasso Regression:
11 Features |
.4703 | .355 |
Lasso Regression:
4 Features |
.4721 | .361 |
It is interesting to note that neither the Training Error, nor the Kaggle Error, suffer much from a very reduced model (in terms of number of features). It is also noteworthy that the Kaggle Error (error on a test set) is considerably lower than the error on our training set, which is surprising. We suspected the housing prices in the test set to be distributed quite differently than those in the training set. For the sake of experimentation, we then removed “timestamp” from the model to see how this would change our results.
Model Type: | Training Error (RMSE): | Kaggle Error (RMSE): |
Linear Regression:
3 Features (No Time) |
.4768 | .347 |
Our Kaggle Error improved considerably, which led us to infer that housing prices likely dropped substantially (on average) from the training to the Kaggle Test set (since in the training set, housing prices trended upward with time, so eliminating time yields lower price predictions). To investigate this more closely, we turned to the macro data to see if it could be used to effectively understand/predict housing market changes.
The macro data defies prediction
Our approach to the macro data was to try to find a relationship between the Russian economy and the average Russian housing price level.
If the macro could predict the average price level to some extent, the goal was to combine the macro features with the house specific data via model stacking to come up with a final prediction.
First, we needed create a response variable (as we did not have any price information in the original macro data set) and looked at the daily average. However, the time series was really noisy as it was dependent on the particulars of the houses sold on a given day. This is why, we took a 30 day rolling average to smooth everything out (the blue line in the graph below) and looked at price per square meter to mitigate the effect of size on price (the green line).
The two series were highly correlated (as depicted in the graph below), meaning that we could use them interchangeably with similar results.
The next step was to identify macro-economic factors that best predicted the price and could be used in our final model. The macro-economic variables were highly correlated, which narrowed down the selection significantly.
In reality, we already had an idea of what we wanted to see from the data. We had run some trials on Kaggle where we submitted the same price for every house in order to determine the average price level. We realized that, even though the price level was mostly increasing during the training period, it was decreasing during the test period. This is why we were looking for a negative coefficient in the macro data, to predict a decrease. Unfortunately, there was none. What we saw in the macro data was a paradigm shift, where an endogenous event like an action by a central bank was shaking up the relationship between the variables, which probably happened right on the border between the test and training set.
The silver lining is that the average price level in the training set and the average price level in the test set ended up being about the same, so if we ignored time and any macro data, we got better results.
Multiple linear model outperforms the fancy models
With these insights in mind (i.e. fewer features is better and time is misleading), we decided to run additional models on our main dataset that would exclude any notion of time and use a limited number of variables.
We started with a multiple linear regression. Aside from its high interpretability, a linear regression is more powerful when it comes to extrapolation. This is important when the predictions for the test set are outside the range of values of the predictor in the training set used to generate the model. The latter scenario was relevant to our dataset. For example some sub-areas were present in the test dataset but not the training and vice versa.
The Variance inflation factors (VIF) revealed some multicollinearity between our initial features. Our final selection (after accounting for multicollinearity) included 7 variables. None of them were time related. Our score on Kaggle brought us in the top 50%.
The next steps were to improve the accuracy of our model. There are multiple ways to do so: doing feature engineering, deciding on ways to treat missing data and outliers, refining our selection of feature, tuning our algorithm, doing ensemble methods, etc. We decided on two strategies:
- Refining the selection of our features
- Changing our model and tuning our algorithm
We also employed different methods to treat missing data and outliers. However the score did not change much. Doing feature engineering by creating new variable(s) from existing ones did not help much either.
We’ll go into more details for the two main strategies we ended up choosing:
1- Refining the selection of our features
Principal Component Analysis (PCA) works best on datasets with more than 3 dimensions as it becomes harder to make interpretations with a high number of variables. This was perfect for our dataset.
We applied PCA to our numeric variables. We ended up with 10 principal components which we used in our linear model. This strategy did not improve the accuracy of our model and the biplot below explains why.
The points in the graph represent the scores of the observations on the principal components. Points are close together, which shows that observations have similar scores on the components displayed in the plot. This is an indicator that the components did not help differentiate the observations as well as we would have expected. The variance explained by each principal component was in fact pretty low (as depicted in the table below) and the principal components were loaded equally on the variables. This is why the components were not necessarily better predictors than the raw variables.
2- Selecting a different model and tuning our algorithm
The last strategy to improve the accuracy of our model was to run a random forest. We tuned our algorithm by playing with the mtry hyper-parameter.
A little bit of an explanation on the mtry hyper parameter: when forming each split in a tree, the algorithm randomly selects mtry variables from the set of predictors available. Hence when forming each split a different random set of variables is selected within which the best split point is chosen.
We used a mtry of 53 (which was equal to the number of features divided by 3, usually best for regression) and a mtry of 12 (which was equal to the square root of the total number of features). The mean squared root error was very low (0.07 for the random forest with 53 mtry and 0.15 for the random forest with 12 mtry), indicating an overfit (which is confirmed in the graph below). Our score on Kaggle was indeed lower than the multiple linear regression by a slight margin (0.35346 vs. 0.35198).
There are a few reasons why the random forest did not outperform the initial linear model.
- First, a random forest averages the predictions of a large number of classifiers. When we have a single point which is far from the others, many of the classifiers will not see it so random forest performs badly in the extremes.
- Second, a random forest does not predict beyond the range in the training data. Some sub-areas were in the training data but not the testing data and vice versa. In the context of a time series, a random forest also fails to extrapolate the out-of-sample trend.
- Third, we need a lot of branches per tree to get a good approximation to a linear function. Although two of our seven variables were dummified, the branches in our tree were limited preventing a random forest from doing what it does best that is feature selection.
- Finally, regression perform well over continuous variables and random orest over discrete variables. The number of discrete variables were limited in our case.
Conclusion
There is no doubt that using XGBoost would have helped boost our score. However, the main goal of the project for our team was to stick to simpler models that are highly interpretable, rather than rely too much on black box models (granted random forest belongs to one of them). Although, using random forest did not work for us, it did work for other teams when using different parameters. However, as mentioned above, tree methods are not relevant when extrapolating and the competition hid the weakness of tree method, since the housing price fell in the testing set (but stayed in the same range as the price in the training set).