Kaggle competition (top 3% ): Optimizing Russian housing price prediction by a deep dive into the model selection and feature engineering

Russian Housing Market 

The goal of Kaggle Competition is to predict Moscow's housing prices provided by Sberbank by machine learning models and feature engineering.

We were able to achieve a satisfactory Kaggle Score of 0.314 (RMSLE) by a deep dive into the machine learning model selection and feature engineering. We have achieved a top 3% out of 3274 teams in the final leading board.

Major predictors used across models

Feature Selection with Random Forest & Lasso

We divided features into 16 subgroups (i.e. demographics). 

We ran random forests and Lasso on each subgroup.

Interpretations: In most groups, LASSO would provide that all the features in the group are significant at the minimum MSE level for lambda parameter.

In order to select features, we have chosen parameters that would go to zero slowest as lambda increases.

35 features came out to be ideal in this case with MSE of 0.31 or RMSE of 0.56 for training data set and 0.46 on Kaggle’s testing set score 0.35.

Features Selected

  • Apartment characteristics
  • Distance to transportation, services, lifestyle needs.
  • Demographics
  • Neighborhood Characteristics 
  • Raion Characteristics

Multiple Linear Regression

We have also tried utilizing Multiple Linear Regression using select features that would make sense in making housing price prediction

For the simplest model, we have used ‘full_sq’(size of the unit), ‘ttk_km’(distance to the Third Ring), and ‘public_transport_station_min_walk’(minutes to walk to public transportation station)

Result was that this simple model gave a superior result to LASSO model with 35 features with RMSE of 0.499 on training set and Kaggle’s score of 0.37535

Using 15 features, we were able to lower RMSE a bit further to 0.466 on training set and Kaggle’s score of  0.35189

Macro data may not be as helpful as it is time series data and if year/month are included as independent variable, it would incorporate the time element

XGBoost

Features Selection: 11 main features + 28 selected features +macro features

  • Macros: CPI, PPI,gdp_deflator etc.
  • Feature Engineering
    • Density = Raion Population /Area Size
    • Month/Weekly Transaction volume Count
    • df['dow'] = df.date.dt.dayofweek (Using datetime library)
    • Relative Floor= Floor / Max. No of Floor
    • Avg. Room Size=living area/ No. of Rooms

Summary of Results

Conclusion:  What to focus in the real world

  • XGBoost is able to give best result for RMSE
  • LASSO and Random Forest results: <  Multiple Linear Regression with 3 key features
  • Common sense VS cumbersome models
  • Efficient and reasonable study design
  • Focus on the several key features
  • When starting a project, we will start from the most intuitive and simplest route
  • Kaggle is really addictive

About Authors

Avatar

Daniel Rim

Daniel Rim has been working as Quant Analyst working to analyze undervalued equity investments in Emerging Markets. His educational background has been in fields such as mathematics,physics, and statistics. He has been immersing himself into the innovation and...
View all posts by Daniel Rim >
Choutine Zhou

Choutine Zhou

Choutine Zhou is an innovative thinker and problem solver who delivers actionable insights using data science tools. She focuses on translating business cases into analytical processes from data exploration to decision makings. Choutine holds a bachelor degree in...
View all posts by Choutine Zhou >
Jade Ngoc Le-Cascarino

Jade Ngoc Le-Cascarino

Jade graduated magna cum laude from Columbia University and Sciences Po Paris where she obtained two BA’s in Political Science and the Social Sciences. She is particularly interested in the intersection between politics and data science, fiercely advocating...
View all posts by Jade Ngoc Le-Cascarino >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Big Data Book Launch Book-Signing bootcamp Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Industry Experts Job Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest recommendation recommendation system regression Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Tableau TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp