In April 2017, Sberbank, Russia’s oldest and largest bank, created a Kaggle competition with the goal of predicting realty prices in Moscow. Given Russia’s volatile economy, this was a unique challenge. Sberbank provided Kagglers with a rich dataset of over 200 variables, depicting both housing features and macroeconomic information, to address this challenge.

## Workflow

With only two weeks to generate results from this daunting dataset, our team had to quickly explore the data and implement machine learning techniques to best predict housing prices. After a couple of days of individual research into Russian real estate and investigation of Kagglers’ kernels, our team set out to manage our project as depicted in the workflow below.

The process was fast-paced and iterative as each model provided new insights and questions into the data. Below, we describe the exploratory data analysis, cleaning and imputation, feature engineering, and model selection and performance. The primary models we used for this task were:

- Lasso Regularization
- Multiple Linear Regression
- Random Forest
- Boosted Random Forest (XGBoost)

Exploratory Data Analysis and Data Cleaning

**Data Statistics:**

To have an overview of the percentages of missing values in each of the dataset, we constructed the bar-plots below.

In the train dataset, out of the 291 features 51 have missing values. Out of the 290 features, 48 have missing values in the test dataset whereas for the macro dataset, 91 features have missing values out 100.

With reference to the extensive EDA prepared by Troy Walters (another fellow Kaggler), we identified the train/test features which have strong correlation with price_doc (target variable) and started exploring each feature to get a better grasp of the nature and distribution of the data.

A histogram plot of the “price_doc” indicates that the distribution of price is right-skewed as shown below. We log-transformed the “price_doc” to obtain a normal distribution.

Exploring further and analyzing each feature revealed substantial amount of data quality issues. The below boxplots for “material” and “state” columns against the “price_doc”, show unusual distribution for material value 3.0 and state value 33.0 indicating potential issue with data. Further inspection showed the presence of only one observation for material = 3.0, which made sense to assign it to NA. “state” = 33.0 has been most likely wrongly recorded and is corrected to 3.0.

The “built_year” column contained erroneous entries such as “20052009”, “4965”, “71”, which we corrected as “2009”, “1965”, “1971” respectively using best logical judgment. Also, there few observations where the “built_year” values were entered under “kitch_sq” column and few where the values were being recorded with single digit number. We placed the values under the correct column and assigned NAs for those records with single digit built year value.

For records with “life_sq” greater than “full_sq” and “full_sq” greater than 1000, we reduced the “life_sq” and “full_sq” values proportionately by a factor of 10 based on the distribution of the data values. For records with “max_floor” > 60, the “max_floor” values were assigned to NAs as they are likely potential outliers.

## Missing Imputation

For NA or 0 “life_sq” values, we imputed the missing values by the mean of the “life_sq” column values if the mean is less than the corresponding “full_sq” value of each record. Otherwise, the values are imputed with (mean-10). Missing “max_floor” values were imputed with the median value of the “max_floor” column. For “material” and “num_room” we used mode of the respective column values to impute the missing data.

“built_year” and “state” had substantial number of records ( 13605 and 13559 respectively in the train dataset; 1049 and 694 respectively in the test dataset) with missing values as shown in the missingness matrix below. We did not think it was a good idea to just consider complete cases, as there might be potential information loss given the high number of missing records for these two features. We used other features with available data as predictors and set these two features as “target” variable (one at a time) and generated predicted values using XGBoost.

## Feature Engineering and Selection

Since the dataset had quite a few categorical features, we used Label encoding to convert the categorical values into numeric values. Along with the original features, we combined features that were showing up as important and created new features. To calculate the importance of the variables in predicting the housing prices, we used Ridge and LASSO regularization as generalized linear models.

## Ridge Regression

The model was validated using 100 values from 0.01 to 100 for the shrinkage penalty lambda. Before this, the features with skewed values were normalized so that they follow a normal distribution. Since the magnitude of coefficients distort the output of the model, we standardized the values using a standard scalar. The coefficient plot of the ridge regression shows the features in the descending order in which they converge towards zero. However, the L2 regularization does not select a subset of important features and hence we proceeded with the LASSO.

## Lasso Regression

We fed the model a range of lambda values similar to the ridge regression. The benefits of using the L1 norm over the L2 norm is that after a certain value of the shrinkage parameter, the coefficients drop to zero. The LASSO model eliminated 363 features and selected 95. The rmse values of the Ridge and the LASSO regression occurred very close to each other.

## Multiple Linear Regression

For an initial linear regression, we took Lasso’s top 30 important features and only the complete cases within the dataset so as to not deal with any imputation. This amounted to about 14,000 observations to train the model. Using R’s lm function, we trained and tested the linear model on an 80-20 split of the training data.

Unfortunately, further investigation proved that the data violated the assumptions of linearity. The Q-Q plot (the first graph below) is in violation of the normality assumption of the residuals. Additionally, we found that a couple of the observations were outside of Cook’s distance, showing they are highly influential outliers.

Taking these results into consideration, we eliminated those two outlying observations from the dataset and ran a backwards stepwise function measuring BIC to narrow down the number of features to 20. Next, we tested the VIF (variance inflation factor) of the features as a way to target multicollinearity and eliminated three more predictors. With 17 features, the model had an R squared of 0.3366 and our predictions had an RMSLE of 0.46986.

We then ran a second multiple linear regression in python using the cleaned, imputed data and 46 predictors. Some of these predictors came from the macroeconomic dataset as well.

Dropping 17 variables after running a VIF test left us with 29 predictors and an R squared value of 0.3631. This model improved our RMSLE score significantly to a 0.39653.

## Random Forest

We also employed a random forest model on our training set with the hopes of yielding more accurate predictions. Random forests are powerful models because they theoretically can’t overfit the data, but we had to be wary of parameter tuning and predictor variables included in the model. We used scikit-learn’s RandomForestRegressor module and experimented with different minimum sample leaf sizes, number of estimators, and model features.

An important aspect when improving the random forest models was deciding which variables to include in the model. Since the data was a time series, we wanted to make sure that we had variables which captured the change over time for select variables. In our case, we looked at the data in the macro dataset to get a high level picture of how the Russian economy might affect the housing prices. From the macro dataset, we included a one month, three month, six month, and one year time lag to show the rate of change for oil prices, US dollar to Russian ruble exchange rate, and inflation rate. We then included these features as predictors in our random forest regression model.

In the end, our best model gave us a root mean squared error of about 0.326. Unfortunately, the feature engineered variables, which were mostly variables included from the macro dataset, did not actually improve our model in any way, with our most important features consistently being variables that described housing attributes and location attributes. Although a standard random forest model won’t out-perform an XGBoost model, it is easier to understand which features might be important to the model, and with more time, I would attempt to make the feature engineered variables improve our model.

## XGBoost

Having obtained a score of 0.326 with a Random Forest model, we then began training a model using XGBoost. XGBoost, short for extreme gradient boosting, is a powerful variant of *Friedman’s* Stochastic Gradient Boosting algorithm and can be consistently found in Kaggle-winning models. In short, Random Forest models use random subsets of your data at each iteration whereas XGBoost iterates depending on each previous iterations outcome.

As is often the case with training supervised machine learning models, we optimize an objective function (consisting of a loss component and a regularization component) to determine the parameters that will give us the strongest model:

Obj(Θ) = L(θ) + Ω(Θ)

In XGBoost, the parameters tune the regularization component of the objective function, with each playing some role in the complexity of the model. With our parameters tuned, we were able to obtain a score of 0.3131, outperforming the rest of our models (details on what changing each parameter does can be found here).

## Conclusions & Future Work

After two weeks and forty submissions, our team landed in the top 6% of 2,000 Kaggle teams with an RMSLE score of 0.3131. Although less interpretable, the black-box XGBoost model performed the best. Surprisingly, imputing with the median (rather than any sort of regression or nearest-neighbors algorithm) and eliminating outliers before training the models gave better results as well.

As further investigation into our results, we graphed the density plots of some of our predictions for the test set.

As shown in the chart above, our best score seems to be the most normally-distributed about an average price around 7,400,000 rubles.

We believe that running a principal component analysis for dimensionality reduction and testing out further feature engineering ideas may lead to better results. Additionally, practicing Bayesian Optimization for finding the best parameters would be much more time-efficient than a grid search.

Now that we have seen how the models work in practice and learned how to best communicate our results with one another, we can better plan how to connect our code and possibly put together an ensemble or stacked model. Additionally, given the time-constraint of the test data, implementing time series analysis would lead to a better interpretation of the problem.