Using Data to Predict housing prices in Russia
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Predicting housing prices in Russia
Sberbank, Russia's largest bank, published housing data on Kaggle in early 2017. The bank's goal was to better understand which aspects (such as the size of one's home or proximity to a train station) affected individual housing prices.
Our team, consisting of Chris Behrens, Fu-Yuan Cheng, Jason Chiu, and Fouad Yared, cleaned the data sets, performed exploratory data analysis, conducted linear regression, lasso regression, random forest, and xgboost in order to address the task at hand.
The team used R for data cleaning, exploratory data analysis, principal components analysis, linear regression, and lasso regression. Python was used for k-means cluster analysis, random forest and xgboost.
Before beginning our analysis and model creation, we wanted to understand the variables. Sberbank's housing data was split into two groups: a training set to train our machine learning models and a test set which we would predict.
Cleaning the data set
Our data cleaning process consisted of two steps: 1) finding which features contained extreme or erroneous values and 2) identifying and filling in missing data.
Since data cleaning was handled in R, the summary() and table() methods were used to find features that required further inspection and possible action. Features with extreme or erroneous values were nullified.
In assessing missingness, the team found 51 of the 292 variables in the training set had missing values. Variables with missing values that were highly correlated with others were removed from our analysis. A variable that was missing more than 40% of its values (the count of hospital beds) was removed from our analysis.
Figure 0: The different types of variables found in Sberbank's Russian housing data set. The middle variable, property price, is the variable we are predicting. The other variables are used in creating our prediction.
Missing data was most often filled in by calculating the median value of a given feature based on its sub-area. K-Nearest Neighbor (KNN) was also used to compute missingness. It varied from the group median since it considered many variables when filling in missing data.
By inspecting the data set, our team found missingness was related to time. For instance, if we look at the eight building characteristics (including the year a home was built and the total square meters of a home), we see that 6 of these 8 variables were missing 100% of their values from Q3 2011 to Q1 2013. (Dates for the training data set ranged from Q3 2011 to Q2 2015.)
Figure 1: Missingness by quarter
Each panel represents one of eight building characteristics and the percentage of missing values for each quarter of the year, starting with Q3 2011 and ending with Q2 2015. In the top left we see build_year was missing 100% of its values from Q3 2011 to Q1 2013. The same pattern of missingness is shared by five other variables: kitch_sq, material, max_floor, num_room, and state.
Also, missingness was often shared among variables. When one variable was missing a set number of values, a related variable would often have the same number of missing values as well.
Understanding our data through exploratory data analysis
In order to find which variables should be included in our models, our team looked at the one hundred variables that correlated highest with the price of housing. If there were a group of variables that were highly correlated with one another, only one variable would be retained for our models.
The three variables that correlated highest with a home's price were the total area in square meters, the living area in square meters, and the number of rooms. An underlying issue is that each of these three variables does not follow a normal distribution. In a linear model, these variables will need to be transformed in order to satisfy the normality assumption of multiple linear regression and lasso regression.
Exploratory data analysis showing the correlation of an entire unit's square meters compared to its price (top-left), the number of rooms by price (top-right), the livable square meters of a unit by its price (bottom-left), and the size of the kitchen compared to the price of a unit (bottom-right). On the right hand side we have a correlation matrix showing how related each of the building characteristics are to one another.
Other building, location, lifestyle (the number of cafes, sport facilities, and/or shopping malls), education, and cultural characteristics did not have as strong of individual correlations with price.
Figure 3: A listing of the 23 variables most highly correlated with a house's price. They are ranked in absolute value.
While correlations are useful in understanding the relationship of two variables, they do not take into account the effects of other variables, which will be considered through our multivariate models.
After selecting the most correlated and most unique features, we built multiple linear regression and lasso regression models to 1) find out the relationship of the remaining variables and a house's price and 2) to further reduce the number of variables used in interpreting our model.
Building our linear machine learning models
Figure 4: Multiple Linear Regression Model Building Process
The team prioritized interpretability and story-driven model building, and therefore, adopted a bottom-up approach in feature selection and engineering (Figure 4).
From the very beginning, we classified features into subgroups, and unsupervised learning techniques were used to engineer additional features.
For example, distance to public transportation included walking and driving distances to metro stops, train stations, bus terminals, and railroad stations. Many transportation features were highly correlated, and a principal component analysis (PCA) was used to remove information redundancy and to reduce the number of dimensions.
We also used k-means clustering to examine the underlying patterns of environmental conditions in neighborhoods. We found that overall neighborhoods could be classified into these subgroups:
- safe neighborhoods with no exposure to environmental toxins,
- neighborhoods that are exposed to nuclear reactor and/or radioactive waste, and 3) toxic industrial neighborhoods that are exposed to multiple sources of toxins including a chemical industry, radioactive waste, and thermal power plants (Table 1).
The team examined one subgroup of features at a time, such as education and life style, and new features were then included in the model building.
Due to the high number of features, a combination of forward Multiple Linear Regression and Lasso Regression were used to identify the most important variables. Model fit was assessed using training data R2 (Coefficient of Determination), AIC (Akaike Information Criterion), and root mean squared logarithmic error (RMSLE), and Kaggle RMSLE (testing data).
We chose the forward selection method in multiple regression to retain only the variables that had a significant contribution to the price. Lasso regression also selects a subset of the variables we originally pass, as the other variables shrink to zero if they are not significant predictors of price.
Evaluating our linear machine learning models
While the training data R2, AIC, and RMSLE continued to improve, we soon encountered the challenge of overfitting as our training error was much better than our test error.
The basic assumptions of MLR were examined and multiple assumptions were violated with the current model, such as residual normality (Figure 5). Most importantly, non-linear patterns were observed in residuals vs. predicted log prices.
Figure 5. Residual Q-Q plot
We further investigated the lines and found that many listings in the data sets with different characteristics were sold at the same price. For example, around 800 properties were sold at 2 million Rubles. With MLR, we noticed there was a linear relationship between the total area (in square meters) of a property and its price, yet many properties with different characteristics were sold at the same price. This revealed a violation in linearity.
After careful consideration, the team decided that MLR will not fit the current data. Instead, non-linear tree-based models, namely random forest and xgboost, were used to better understand the relationship between our predictors and our target variable.
Our error terms should be fairly consistent over time, yet we have multiple lines that note the linear models we've used aren't enough for accurately predicting housing prices.
We chose to use a random forest model for two reasons. First, a random forest would look at non-linear relationships between our predictors and target variable. Second, it would help identify the most important features through its feature importance output.
We started with a random forest model with the building characteristics, such as total square meters, the build year, and whether it was owner-occupied or investor-owned. Additional features were added to enhance our model fit.
We used 10-fold cross validation and grid search to find the best hyper-parameters, which included the number of estimators and the max features to be selected in each tree. We also used one-hot encoding in Python to convert our sub-areas and administrative regions into dummy variables, which we tested in the model.
Our models were evaluated with the root mean squared logarithmic error (RMSLE) and Kaggle RMSLE (testing data).
We determined the best hyper-parameters for our model that included sub-areas and administrative regions were 400 trees with 20 max features and 400 trees with 12 features. Random forest didn't provide the overfitting problems like linear or lasso regression. Most importantly, we decreased the RMSLE (error) significantly.
We further investigated feature importance. The total square meters of a unit and the number of rooms turned out to be important features in our random forest model (which included sub-areas and administrative regions) and in our linear models.Figures 7 and 8 show the most important variables for our sub-area random forest model and for our administrative region random forest model, respectively.
Interestingly, although administrative regions had a low feature importance, the RMSLE decreased when we took regions into consideration.