Kaggle competition (top 3% ): Optimizing Russian housing price prediction by a deep dive into the model selection and feature engineering

Daniel Rim, Choutine Zhou, Li and Jade Ngoc Le-Cascarino

Posted on Jun 22, 2017

Russian Housing Market

The goal of Kaggle Competition is to predict Moscow's housing prices provided by Sberbank by machine learning models and feature engineering.

We were able to achieve a satisfactory Kaggle Score of 0.314 (RMSLE) by a deep dive into the machine learning model selection and feature engineering. We have achieved a top 3% out of 3274 teams in the final leading board.

Major predictors used across models

Feature Selection with Random Forest & Lasso

We divided features into 16 subgroups (i.e. demographics).

We ran random forests and Lasso on each subgroup.

Interpretations: In most groups, LASSO would provide that all the features in the group are significant at the minimum MSE level for lambda parameter.

In order to select features, we have chosen parameters that would go to zero slowest as lambda increases.

35 features came out to be ideal in this case with MSE of 0.31 or RMSE of 0.56 for training data set and 0.46 on Kaggle’s testing set score 0.35.

Features Selected

Apartment characteristics
Distance to transportation, services, lifestyle needs.
Demographics
Neighborhood Characteristics
Raion Characteristics

Multiple Linear Regression

We have also tried utilizing Multiple Linear Regression using select features that would make sense in making housing price prediction

For the simplest model, we have used ‘full_sq’(size of the unit), ‘ttk_km’(distance to the Third Ring), and ‘public_transport_station_min_walk’(minutes to walk to public transportation station)

Result was that this simple model gave a superior result to LASSO model with 35 features with RMSE of 0.499 on training set and Kaggle’s score of 0.37535

Using 15 features, we were able to lower RMSE a bit further to 0.466 on training set and Kaggle’s score of 0.35189

Macro data may not be as helpful as it is time series data and if year/month are included as independent variable, it would incorporate the time element

XGBoost

Features Selection: 11 main features + 28 selected features +macro features

Macros: CPI, PPI,gdp_deflator etc.
Feature Engineering
- Density = Raion Population /Area Size
- Month/Weekly Transaction volume Count
- df['dow'] = df.date.dt.dayofweek (Using datetime library)
- Relative Floor= Floor / Max. No of Floor
- Avg. Room Size=living area/ No. of Rooms

Summary of Results

Conclusion: What to focus in the real world

XGBoost is able to give best result for RMSE
LASSO and Random Forest results: < Multiple Linear Regression with 3 key features
Common sense VS cumbersome models
Efficient and reasonable study design
Focus on the several key features
When starting a project, we will start from the most intuitive and simplest route
Kaggle is really addictive

About Authors

Daniel Rim

Daniel Rim has been working as Quant Analyst working to analyze undervalued equity investments in Emerging Markets. His educational background has been in fields such as mathematics,physics, and statistics. He has been immersing himself into the innovation and...

View all posts by Daniel Rim >

Choutine Zhou

Choutine Zhou is an innovative thinker and problem solver who delivers actionable insights using data science tools. She focuses on translating business cases into analytical processes from data exploration to decision makings. Choutine holds a bachelor degree in...

View all posts by Choutine Zhou >

Li

View all posts by Li >

Jade Ngoc Le-Cascarino

Jade graduated magna cum laude from Columbia University and Sciences Po Paris where she obtained two BA’s in Political Science and the Social Sciences. She is particularly interested in the intersection between politics and data science, fiercely advocating...

View all posts by Jade Ngoc Le-Cascarino >

Python

2024/2025 NBA Season Team & Player Analysis (Python Project)

Capstone

Catching Fraud in the Healthcare System

Capstone

The Convenience Factor: How Grocery Stores Impact Property Values

Capstone

Acquisition Due Dilligence Automation for Smaller Firms

Machine Learning

Pandemic Effects on the Ames Housing Market and Lifestyle

Cancel reply

You must be logged in to post a comment.

No comments found.

Kaggle competition (top 3% ): Optimizing Russian housing price prediction by a deep dive into the model selection and feature engineering

About Authors

Daniel Rim

Choutine Zhou

Li

Jade Ngoc Le-Cascarino

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Kaggle competition (top 3% ): Optimizing Russian housing price prediction by a deep dive into the model selection and feature engineering

About Authors

Daniel Rim

Choutine Zhou

Li

Jade Ngoc Le-Cascarino

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!