Kaggle competition (top 3% ): Optimizing Russian housing price prediction by a deep dive into the model selection and feature engineering
Russian Housing Market
The goal of Kaggle Competition is to predict Moscow's housing prices provided by Sberbank by machine learning models and feature engineering.
We were able to achieve a satisfactory Kaggle Score of 0.314 (RMSLE) by a deep dive into the machine learning model selection and feature engineering. We have achieved a top 3% out of 3274 teams in the final leading board.
Major predictors used across models
Feature Selection with Random Forest & Lasso
We divided features into 16 subgroups (i.e. demographics).
We ran random forests and Lasso on each subgroup.
Interpretations: In most groups, LASSO would provide that all the features in the group are significant at the minimum MSE level for lambda parameter.
In order to select features, we have chosen parameters that would go to zero slowest as lambda increases.
35 features came out to be ideal in this case with MSE of 0.31 or RMSE of 0.56 for training data set and 0.46 on Kaggleโs testing set score 0.35.
Features Selected
- Apartment characteristics
- Distance to transportation, services, lifestyle needs.
- Demographics
- Neighborhood Characteristics
- Raion Characteristics
Multiple Linear Regression
We have also tried utilizing Multiple Linear Regression using select features that would make sense in making housing price prediction
For the simplest model, we have used โfull_sqโ(size of the unit), โttk_kmโ(distance to the Third Ring), and โpublic_transport_station_min_walkโ(minutes to walk to public transportation station)
Result was that this simple model gave a superior result to LASSO model with 35 features with RMSE of 0.499 on training set and Kaggleโs score of 0.37535
Using 15 features, we were able to lower RMSE a bit further to 0.466 on training set and Kaggleโs score of 0.35189
Macro data may not be as helpful as it is time series data and if year/month are included as independent variable, it would incorporate the time element
XGBoost
Features Selection: 11 main features + 28 selected features +macro features
- Macros: CPI, PPI,gdp_deflator etc.
- Feature Engineering
- Density = Raion Population /Area Size
- Month/Weekly Transaction volume Count
- df['dow'] = df.date.dt.dayofweek (Using datetime library)
- Relative Floor= Floor / Max. No of Floor
- Avg. Room Size=living area/ No. of Rooms
Summary of Results
Conclusion: What to focus in the real world
- XGBoost is able to give best result for RMSE
- LASSO and Random Forest results: < Multiple Linear Regression with 3 key features
- Common sense VS cumbersome models
- Efficient and reasonable study design
- Focus on the several key features
- When starting a project, we will start from the most intuitive and simplest route
- Kaggle is really addictive