Kaggle competition (top 3% ): Optimizing Russian housing price prediction by a deep dive into the model selection and feature engineering

Russian Housing Market 

The goal of Kaggle Competition is to predict Moscow's housing prices provided by Sberbank by machine learning models and feature engineering.

We were able to achieve a satisfactory Kaggle Score of 0.314 (RMSLE) by a deep dive into the machine learning model selection and feature engineering. We have achieved a top 3% out of 3274 teams in the final leading board.

Major predictors used across models

Feature Selection with Random Forest & Lasso

We divided features into 16 subgroups (i.e. demographics). 

We ran random forests and Lasso on each subgroup.

Interpretations: In most groups, LASSO would provide that all the features in the group are significant at the minimum MSE level for lambda parameter.

In order to select features, we have chosen parameters that would go to zero slowest as lambda increases.

35 features came out to be ideal in this case with MSE of 0.31 or RMSE of 0.56 for training data set and 0.46 on Kaggle’s testing set score 0.35.

Features Selected

  • Apartment characteristics
  • Distance to transportation, services, lifestyle needs.
  • Demographics
  • Neighborhood Characteristics 
  • Raion Characteristics

Multiple Linear Regression

We have also tried utilizing Multiple Linear Regression using select features that would make sense in making housing price prediction

For the simplest model, we have used ‘full_sq’(size of the unit), ‘ttk_km’(distance to the Third Ring), and ‘public_transport_station_min_walk’(minutes to walk to public transportation station)

Result was that this simple model gave a superior result to LASSO model with 35 features with RMSE of 0.499 on training set and Kaggle’s score of 0.37535

Using 15 features, we were able to lower RMSE a bit further to 0.466 on training set and Kaggle’s score of  0.35189

Macro data may not be as helpful as it is time series data and if year/month are included as independent variable, it would incorporate the time element


Features Selection: 11 main features + 28 selected features +macro features

  • Macros: CPI, PPI,gdp_deflator etc.
  • Feature Engineering
    • Density = Raion Population /Area Size
    • Month/Weekly Transaction volume Count
    • df['dow'] = df.date.dt.dayofweek (Using datetime library)
    • Relative Floor= Floor / Max. No of Floor
    • Avg. Room Size=living area/ No. of Rooms

Summary of Results

Conclusion:  What to focus in the real world

  • XGBoost is able to give best result for RMSE
  • LASSO and Random Forest results: <  Multiple Linear Regression with 3 key features
  • Common sense VS cumbersome models
  • Efficient and reasonable study design
  • Focus on the several key features
  • When starting a project, we will start from the most intuitive and simplest route
  • Kaggle is really addictive

About Authors

Daniel Rim

Daniel Rim has been working as Quant Analyst working to analyze undervalued equity investments in Emerging Markets. His educational background has been in fields such as mathematics,physics, and statistics. He has been immersing himself into the innovation and...
View all posts by Daniel Rim >

Choutine Zhou

Choutine Zhou is an innovative thinker and problem solver who delivers actionable insights using data science tools. She focuses on translating business cases into analytical processes from data exploration to decision makings. Choutine holds a bachelor degree in...
View all posts by Choutine Zhou >

Jade Ngoc Le-Cascarino

Jade graduated magna cum laude from Columbia University and Sciences Po Paris where she obtained two BA’s in Political Science and the Social Sciences. She is particularly interested in the intersection between politics and data science, fiercely advocating...
View all posts by Jade Ngoc Le-Cascarino >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI