From Khrushchev to Kaggle: The Russian Real Estate Market

In April 2017, Sberbank, Russia’s oldest and largest bank, created a Kaggle competition with the goal of predicting realty prices in Moscow. Given Russia’s volatile economy, this was a unique challenge. Sberbank provided Kagglers with a rich dataset of over 200 variables, depicting both housing features and macroeconomic information, to address this challenge.


With only two weeks to generate results from this daunting dataset, our team had to quickly explore the data and implement machine learning techniques to best predict housing prices. After a couple of days of individual research into Russian real estate and investigation of Kagglers’ kernels, our team set out to manage our project as depicted in the workflow below.

The process was fast-paced and iterative as each model provided new insights and questions into the data. Below, we describe the exploratory data analysis, cleaning and imputation, feature engineering, and model selection and performance. The primary models we used for this task were:

  • Lasso Regularization
  • Multiple Linear Regression
  • Random Forest
  • Boosted Random Forest (XGBoost)

Exploratory Data Analysis and Data Cleaning

Data Statistics:

To have an overview of the percentages of missing values in each of the dataset, we constructed the bar-plots below.

In the train dataset, out of the 291 features 51 have missing values. Out of the 290 features, 48 have missing values in the test dataset whereas for the macro dataset, 91 features have missing values out 100.

With reference to the extensive EDA prepared by Troy Walters (another fellow Kaggler), we identified the train/test features which have strong correlation with price_doc (target variable) and started exploring each feature to get a better grasp of the nature and distribution of the data.

A histogram plot of the “price_doc” indicates that the distribution of price is right-skewed as shown below. We log-transformed the “price_doc” to obtain a normal distribution.

Exploring further and analyzing each feature revealed substantial amount of data quality issues. The below boxplots for “material” and “state” columns against the “price_doc”, show unusual distribution for material value 3.0  and state value 33.0 indicating potential issue with data. Further inspection showed the presence of only one observation for material = 3.0, which made sense to assign it to NA. “state” = 33.0 has been most likely wrongly recorded and is corrected to 3.0.

The “built_year” column contained erroneous entries such as “20052009”, “4965”, “71”, which we corrected as “2009”, “1965”, “1971” respectively using best logical judgment. Also, there few observations where the “built_year” values were entered under “kitch_sq” column and few where the values were being recorded with single digit number. We placed the values under the correct column and  assigned NAs for those records with single digit built year value.

For records with “life_sq” greater than “full_sq” and “full_sq” greater than 1000, we reduced the “life_sq” and “full_sq” values proportionately by a factor of 10 based on the distribution of the data values. For records with “max_floor” > 60, the “max_floor” values were assigned to NAs as they are likely potential outliers.

Missing Imputation

For NA or 0 “life_sq” values, we imputed the missing values by the mean of the “life_sq” column values if the mean is less than the corresponding “full_sq” value of each record. Otherwise, the values are imputed with (mean-10). Missing “max_floor” values were imputed with the median value of the “max_floor” column. For “material” and “num_room” we used mode of the respective column values to impute the missing data.

“built_year” and “state” had substantial number of records ( 13605 and 13559 respectively in the train dataset; 1049 and 694 respectively in the test dataset) with missing values as shown in the missingness matrix below. We did not think it was a good idea to just consider complete cases, as there might be potential information loss given the high number of missing records for these two features. We used other features with available data as predictors and set these two features as “target” variable (one at a time) and generated predicted values using XGBoost.

Visualizing Missingness

Feature Engineering and Selection

Since the dataset had quite a few categorical features, we used Label encoding to convert the categorical values into numeric values. Along with the original features, we combined features that were showing up as important and created new features. To calculate the importance of the variables in predicting the housing prices, we used Ridge and LASSO regularization as generalized linear models.

Ridge Regression

The model was validated using 100 values from 0.01 to 100 for the shrinkage penalty lambda. Before this, the features with skewed values were normalized so that they follow a normal distribution. Since the magnitude of coefficients distort the output of the model, we standardized the values using a standard scalar. The coefficient plot of the ridge regression shows the features in the descending order in which they converge towards zero. However, the L2 regularization does not select a subset of important features and hence we proceeded with the LASSO.

Lasso Regression

We fed the model a range of lambda values similar to the ridge regression. The benefits of using the L1 norm over the L2 norm is that after a certain value of the shrinkage parameter, the coefficients drop to zero. The LASSO model eliminated 363 features and selected 95. The rmse values of the Ridge and the LASSO regression occurred very close to each other.

Multiple Linear Regression

For an initial linear regression, we took Lasso’s top 30 important features and only the complete cases within the dataset so as to not deal with any imputation. This amounted to about 14,000 observations to train the model. Using R’s lm function, we trained and tested the linear model on an 80-20 split of the training data.

Unfortunately, further investigation proved that the data violated the assumptions of linearity. The Q-Q plot (the first graph below) is in violation of the normality assumption of the residuals. Additionally, we found that a couple of the observations were outside of Cook’s distance, showing they are highly influential outliers.

Taking these results into consideration, we eliminated those two outlying observations from the dataset and ran a backwards stepwise function measuring BIC to narrow down the number of features to 20. Next, we tested the VIF (variance inflation factor) of the features as a way to target multicollinearity and eliminated three more predictors. With 17 features, the model had an R squared of 0.3366 and our predictions had an RMSLE of 0.46986.

We then ran a second multiple linear regression in python using the cleaned, imputed data and 46 predictors. Some of these predictors came from the macroeconomic dataset as well.

Dropping 17 variables after running a VIF test left us with 29 predictors and an R squared value of 0.3631. This model improved our RMSLE score significantly to a 0.39653.

Random Forest

We also employed a random forest model on our training set with the hopes of yielding more accurate predictions. Random forests are powerful models because they theoretically can’t overfit the data, but we had to be wary of parameter tuning and predictor variables included in the model. We used scikit-learn’s RandomForestRegressor module and experimented with different minimum sample leaf sizes, number of estimators, and model features.

An important aspect when improving the random forest models was deciding which variables to include in the model. Since the data was a time series, we wanted to make sure that we had variables which captured the change over time for select variables. In our case, we looked at the data in the macro dataset to get a high level picture of how the Russian economy might affect the housing prices. From the macro dataset, we included a one month, three month, six month, and one year time lag to show the rate of change for oil prices, US dollar to Russian ruble exchange rate, and inflation rate. We then included these features as predictors in our random forest regression model.

In the end, our best model gave us a root mean squared error of about 0.326. Unfortunately, the feature engineered variables, which were mostly variables included from the macro dataset, did not actually improve our model in any way, with our most important features consistently being variables that described housing attributes and location attributes. Although a standard random forest model won’t out-perform an XGBoost model, it is easier to understand which features might be important to the model, and with more time, I would attempt to make the feature engineered variables improve our model.


Having obtained a score of 0.326 with a Random Forest model, we then began training a model using XGBoost. XGBoost, short for extreme gradient boosting, is a powerful variant of Friedman’s Stochastic Gradient Boosting algorithm and can be consistently found in Kaggle-winning models. In short, Random Forest models use random subsets of your data at each iteration whereas XGBoost iterates depending on each previous iterations outcome.

As is often the case with training supervised machine learning models, we optimize an objective function (consisting of a loss component and a regularization component) to determine the parameters that will give us the strongest model:

Obj(Θ) = L(θ) + Ω(Θ)

In XGBoost, the parameters tune the regularization component of the objective function, with each playing some role in the complexity of the model. With our parameters tuned, we were able to obtain a score of 0.3131, outperforming the rest of our models (details on what changing each parameter does can be found here).

Conclusions & Future Work

After two weeks and forty submissions, our team landed in the top 6% of 2,000 Kaggle teams with an RMSLE score of 0.3131. Although less interpretable, the black-box XGBoost model performed the best. Surprisingly, imputing with the median (rather than any sort of regression or nearest-neighbors algorithm) and eliminating outliers before training the models gave better results as well.

As further investigation into our results, we graphed the density plots of some of our predictions for the test set.

As shown in the chart above, our best score seems to be the most normally-distributed about an average price around 7,400,000 rubles.  

We believe that running a principal component analysis for dimensionality reduction and testing out further feature engineering ideas may lead to better results. Additionally, practicing Bayesian Optimization for finding the best parameters would be much more time-efficient than a grid search.

Now that we have seen how the models work in practice and learned how to best communicate our results with one another, we can better plan how to connect our code and possibly put together an ensemble or stacked model. Additionally, given the time-constraint of the test data, implementing time series analysis would lead to a better interpretation of the problem.

About Authors

Sean Kickham

Sean migrated from the Midwest to New York City after graduating with a BS in Mathematics from the University of Notre Dame. He taught middle school math for five years in city schools. Equipped with a Masters in...
View all posts by Sean Kickham >

Reza Rad

Recognizing the increasingly powerful role that data plays in society, Reza enrolled in the NYC Data Science Academy to be better equipped in solving tomorrow's big questions. With a B.S. in Biomedical Engineering from George Washington University, he's...
View all posts by Reza Rad >

Andrew Rubino

Andrew graduated from UC Santa Barbara with a degree in English and Statistics. He previously worked at an adtech company where he learned the ins and outs of cleaning, transforming, and reporting on big data using a variety...
View all posts by Andrew Rubino >

Related Articles

Leave a Comment

Buggy Funbunny July 14, 2017
it's always been my view that quant analysis is appropriate to processes which follow God's rules, but not so much for those which follow humans'. some humans get to change the rules as they see fit, always with the intent of favoring themselves (they sometimes get it wrong, of course). prediction in the realm of human doings is best done by connecting the dots among discrete decisions/events. know that result B follows event A, then decide the likelihood of event A. and so on. after all, as the saying goes, economists have predicted 10 of the last 5 recessions (or there abouts). in this specific case, or any real estate analysis generally, have you done a meta-analysis of the accuracy of such "qualitative" analyses vis-a-vis any quant approach? and which approach won?
From Khrushchev to Kaggle: The Russian Real Estate Market – Mubashir Qasim July 13, 2017
[…] article was first published on R – NYC Data Science Academy Blog, and kindly contributed to […]

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI