Predicting Ames Home Prices With Machine Learning Techniques
Predicting home prices has been a challenge and in many ways an art involving a plethora of variables and moving parts, creative thinking, and vast amounts unpredictability. In a recent Kaggle posting (online data science/quantitative platform), the site featured a home price prediction competition with data from Ames, Iowa. Iowa is known for its mainly known for their corn production; on the other hand, Ames is known for Iowa State University and for having the largest garden gnome in the United States.
While the Kaggle competition does not count towards ranking points because the data are public, teams could still formally submit their predictions and showcase their machine learning and regression capabilities. Furthermore, the official rules on Kaggle do not allow teams to utilize private additional data unless it has been shared with the other teams; however, since this was a class project, each team did not have to share external data included in models.
The Kaggle Ames dataset originally served as an alternative to the classic “Boston Homes” dataset available in most language platforms. Moreover, the Ames dataset consisted of about 80 different variables contributing to thousands of home prices from 2006 to 2010. Within the dataset, there was a mix of categorical and numerical data based on different variables with missing values scattered throughout the dataset.
In addition to the Ames dataset, we gathered data for corn prices, unemployment for Ames, labor force for Ames, schools, mortgage rates from Fannie Mae, and the Dow Jones Real Estate Index (DJREI). It made sense to include corn prices because Iowa is known for their corn production. We included the fixed income and rates variables (unemployment, labor force, and mortgage rates) because they were factors people might consider when buying a home. Different areas will have various unemployment metrics and labor force values, which tends to correlate with income. Mortgage rates are important since individuals generally have to apply for a mortgage to buy a house. The DJREI provides an overview of how the real estate industry is doing as a whole, so individuals might have stronger incentives to invest in real estate depending on the standing of the index. Lastly, schools, particularly public schools, seemed to be important for home prices because public schools are designated to certain districts or neighborhoods. By this notion, it can be inferred certain neighborhoods and homes might be more valuable if the associated school is considered to be rigorous and hold high prestige.
Preliminary Data Cleaning
We first combined the training and test data sets so that we could add the outside data and impute the missing values to the whole dataset. After combining the training, test, and outside datasets, we imputed the missing values. We handled the missing data by the information given to us in the description of the dataset on Kaggle. For values like ‘Alley’ and the basement values (‘Bsmt…’), we were notified that a missing value means that there was no alley or basement. Hence, all the ‘Alley…’ and ‘Bsmt…’ that were missing were filled with ‘None’, if it was a categorical variable, or 0 if it was a numerical variable. We did the same with variables, such as ‘Fence’ and ‘FireplaceQu’. For the categorical variables that were not explained in the dataset, we grouped the dataset by ‘Neighborhood’ and imputed the missing value by replacing the NA with the most common value for that neighborhood. We did the same for the numerical variables but imputed the missing value by finding the mean for the values in that neighborhood.
In order to do the linear regression models, we must only have numerical data in each column. Hence, we dummified the categorical variables into dummy variables.
We know that we want to predict the house prices given the features in the dataset. One of the assumptions of multiple linear regression is normality of the response variable. However, when we look at the histogram of the sale price variable, we notice that there is a skew in the price. Hence, we decided to use the log of the sale price as our response variable, as it gives us a response variable with a standard normal distribution.
Next, we removed outliers. To do this, we looked at a scatterplot of the log of the sale price versus the Living Area. We noticed four observations that didn’t seem to follow the trend of the majority of the data, so we removed these four observations.
We then looked at the distribution of the values of the numerical features. We calculated the skew for each of the features, and if the feature had a skew greater than 0.75, we performed a log transformation of these variables. We had a similar motivation for doing so as when we took the log transformation of the sale price - taking the log of a skewed variable returns a normal distribution.
Once we finished the data cleaning and feature engineering, we divided the dataset back into the training and test datasets.
In order to do the feature engineering, we combined the training and test datasets so that we could add extra data and impute missing values to the full dataset. As mentioned in the ‘Data’ section above, we gathered data for corn prices, unemployment for Ames, labor force for Ames, schools, mortgage rates from Fannie Mae, and the Dow Jones Real Estate Index (DJREI). We combined this outside data with the full dataset. The next steps included imputing the missing values and taking the log of the skewed variables and the response variable. These last few steps and the motivation for doing these steps are discussed in the ‘Preliminary Data Cleaning’ and ‘Feature Engineering’ sections above.
The first model we tested was the random forest regressor. We found the tree that gives the highest mean cross-validation score. To do this, we found the minimum number of samples required to split a node and the minimum number of samples required to be at a leaf node. The random tree with the highest score has a number of 10 samples required to split a node and 2 samples required to be at a leaf node. We then calculated the RMSLE of the model. Additionally, we collected the features that had a positive importance to the regression model. We saved these features for later use when conducting multiple linear regression models.
The second tree model we tested was the gradient boosting model. We plotted a graph that depicted the cross-validation scores versus the number of estimators. We chose the number of estimators to use for our gradient boosting model by observation. This was a point in the graph where the test and training curves only slightly differed. We wouldn’t want to choose a value where the test and training curves overlapped, as this model would be overfitted.
We then tested four separate multiple linear regression models. The first was a ridge regression on all the features of the dataset. We found the RMSE of cross-validation scores over many alpha values. We depicted this in a graph shown in the ‘Results’ section below. We chose the alpha to use for our ridge regression model by finding the alpha value where the RSME was the lowest. We were then able to fit the model to the ridge regression and find the RMSLE for this regression model.
The next model was a lasso regression on all the features of the dataset. The lasso regression model allows us to remove variables that are multi-collinear or that have no effect on the model. Through cross-validation, we found the alpha that would give the highest accuracy on the training data set. We then found the RMSLE for this model.
The next two multiple linear regression models were based on the selected features that had a positive importance in the random forest regressor. We repeated the steps for the ridge and lasso regressions described above but on this new dataset. Finally, we stacked all six models. To stack the models, we weighed each model by the inverse of its error. Finally, we were able to find the RMSLE for this model and compare it to the RMSLE’s for the other models.
Random Forest Regressor
Ridge Regression on All Features
Lasso Regression on All Features
Ridge Regression on Selected Features
Lasso Regression on Selection Features
Final Root Mean-Squared Logarithmic Error for Each Model
When conducting a machine learning model on a set of data, it is best to test different models to find the best model or group of models to use. Because the response variable was a numerical value, we used various multiple regression models. We also decided to use two tree-based machine learning models due to their usefulness in data exploration.
It was important to use penalized regression models (ridge and lasso regression models) when modeling our data to reduce the effect of multicollinearity on the model. The dataset included 80+ variables, so we were sure that multicollinearity was present in our data. By using penalized regression models, we can reduce the number of features we use in our model and improve the accuracy of the models. The tree-based models allowed us to do a sort of ‘feature selection’ and find the features in the dataset that had a positive importance on the house prices. We then tested two more penalized regression models on this set of features that were found to be important by the tree-based models. This grouping of models reduced the number of predictor variables even more but didn’t give us a lower RMSLE.
Finally, we stacked all six models. Each of the models is overfitted to some degree, which is why stacking the models would make the new predictions more accurate and decrease the variance. To stack the models, we weighed each model by the inverse of its error. This allowed the models with higher accuracy to have a greater weight in the final ensemble model than the models with lower accuracy. Finally, we were able to find the RMSLE for this model and compare it to the RMSLE’s for the other models. Just as we expected, the stacked model has the lowest root mean squared logarithmic error, and thus, we used this model to predict the house prices for the houses in the test dataset.
The ridge regression model worked better than the lasso regression model when using all the features and when using the selected features from the tree-based models. The tree-based models were not as accurate when predicting the house prices than the penalized regression models.
The stacked model has the lowest root mean squared logarithmic error, and thus, we used this model to predict the house prices for the houses in the test dataset.
Additionally, when we predicted the house prices using the stacked model, we rounded the predicted prices of the houses up to the nearest thousand. Because our response variable is the log of the house price, it is better to round up, as the penalty for overshooting is less than the penalty for undershooting on a logarithmic scale.