Kaggle Competition : House Pricing in Ames, Iowa
With the breakthrough of Machine Learning in recent years, it has seen rapid and successful deployment across many fields. One of its application is Home Valuation. For this Machine Learning centric project, we used data from a Kaggle competition to predict house prices in Ames, Iowa. The rich dataset of 2919 homes (1460 in the training set) evaluated across 80 features provided excellent material for practicing our Exploratory Data Analysis, Imputation, Feature Engineering, Linear-Based Modeling, Tree-Based Modeling and Ensembling skills. Although we wanted to do well in the competition, our main goal was to thoroughly understand how to use supervised machine learning techniques.
Workflow
We underwent a full development cycle which were mainly divided into 5 Stages. This consists of Exploratory Data Analysis, Imputation, Feature Engineering, Modeling & Hyperparameter Tuning.
Developed Machine Learning Models were:
- Linear-Based Models
- Simple Multi-Linear
- Ridge Regression
- Lasso Regression
- Tree-Based Models
- Decision Trees
- Random Forests
- Gradient Boosting
- Ensemble Models
- Ridge Regression + Lasso Regression + Random Forests
Exploratory Data Analysis (EDA)
To gain a sense of the relationship of the features with each other and with home sales prices, we employed a diverse set of data visualization tools, including the following: density plots, scatterplots, boxplots, and correlation plots.
The first EDA we performed was to examine the distribution of the home sale prices. The histogram of home sale prices appeared to be right-skewed. We therefore performed a log transformation of the home sales prices to make the distribution more Gaussian. The following are histograms of home sales prices before and after the log-transformation:
The second EDA we performed was to create a matrix of Table Plots of the features (x-axis) against the target variable (home sales price). To interpret these density plots, in general we looked for two things: 1) a linear relationship between the feature and the target variable and 2) variation in the density of each feature value versus home sales price. The following is an example of some of the density plots for features we found to have a strong relationship with home sale price:
Other plots we used to explore the relationship between features and home sales prices included scatterplots and boxplots. Examples include the following:
Lastly, we also explored the correlation between the various features using correlation matrices. The following correlation matrix shows the correlations between some of these features, with darker colors indicating higher correlations.
Based primarily on these various EDA results, we narrowed the list of features to around twenty which we considered in our base multiple linear regression model. This will be discussed later in the blog post.
Data Cleaning
In order for machine learning algorithms to work properly, we need to feed them clean data. For our dataset, this means we need to change some feature types and also deal with missing values.
Type Conversion
Features |
MSSubClass, OverallCond, OverallQual, GarageCars, YrSold, MoSold |
The features listed above are all categorical, even though the values are entered in numbers. The different values represent different types, not different amounts, of something. This is easies to see in the case of MSSubClass, where the numbers are actually coded to different categories. But even in the case of a feature like GarageCars, where the number does actually count something (cars), it is actually a categorial feature because differentiates types of garages (1-car, 2-car, etc) not amounts of garages.
To make sure our machine learning algorithms treat these features as categorical, we change their types to string.
Missing Values
We used the below strategies for dealing with missing values
Flagging as 'None'
Features | Alley, BsmtCond, BsmtQual, BsmtExposure, BsmtFinType1, BsmtFinType2, Fence, Functional, FireplaceQu, GarageCond, GarageFinish, GarageQual, GarageType, MasVnrType, MiscFeature, MSSubClass, PoolQC |
In many cases, we want the model to treat observations with missing values as a separate category. For example, we know from the data description that a missing value for ‘PoolQC’ means that the house does not have a pool. It is important to let the algorithm know that some homes do not have pools, because this may affect their value, so we flag the missing values as ‘none’.
This same rationale applies to all but one these features - the house in question does not have the attribute being measured, so we enter the value as 'none'. The only exception is 'Functional' - we still want to flag the missing values for this feature, but we assign the value ‘typ’ instead of 'none' because the data description says that missing values here mean ‘typical functionality’.
Impute Zero
Features | BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, BsmtFullBath, BsmtHalfBath, GarageArea, GarageCars, TotalBsmtSF |
For numeric features, when the house does not have attribute being measured it usually works to impute zero. It makes sense, for example, that the area of a missing garage is zero square feet, and that a missing basement has zero bathrooms.
Impute the Mode
Features | Exterior1st, Exterior2nd, KitchenQual, Electrical, MSZoning |
For these categorial features, we knew the house had the attribute being measured, so we could not impute 'none'. In all the cases, there was one dominant value for most of the data and so we decided to impute the mode as the value of the missing data because, assuming the data are missing completely at random, it is probable that they (like most of the observations) have the most typical value.
Impute the Median by Neighborhood
Features | LotFrontage |
For LotFrontage, we needed to impute a value because it does not make sense that a house is actually missing the attribute. Instead of imputing the median value for the entire dataset, we decided to impute the median for the neighborhood the house is located in to give us a more accurate estimate.
Feature Engineering
It's well known that better quality features produce better quality models, and one way to improve the quality and variety of features is to strategically create new ones by combining existing ones. However, just adding more features isn’t necessarily helpful because we will then need to deal with multicollinearity, the ‘curse of dimensionality’, and increased processing time. Since there is a cost to adding features, we tried to be strategic about which ones to add.
Derived Features
Features | TotalSF = TotalBsmtSF + GrLivArea |
HighQualFinishedSF = TotalSF - LowQualFinSF | |
TotalBaths = FullBath + BsmtFullBath + .5(HalfBath + BsmtHalfBath) |
Derived features are obtained by performing arithmetic on two or more similar features to produce another one. For example, we added several similar features describing the number of baths to obtain the overall number for the house, and we obtained total square footage and total high-quality square footage by performing arithmetic on features that measured square footage.
Feature Interactions
Features | OverallQuality*OverallCond | ExteriorQual*ExteriorCond |
BsmtQual*BsmtCond | GarageQual*GarageCond | |
HeatingQual*HeatingCond | SaleType*SaleCond | |
Neighborhood*BldgType |
There are sometimes features in the dataset that interact with each other, which happens when a change in one feature increases or diminishes the effect of another feature. For example, if you have two homes with a '5' for ExternalCondition but different scores (a '2' and a '10') for external quality, the '5' should be weighted differently based on the quality score. The same condition score means something different when a house is low quality vs high quality, and the same applies vice versa (the same quality means something different when it is in low condition vs high condition). The top three rows in the table above list features that, like ExteriorQual*ExteriorCond, capture interactions between two different measures of the same thing.
We included the feature in the last row because we think there may also be an interaction between type of building and neighborhood. Maybe townhouses are predictably cheap in one neighborhood and predictably expensive in another. And, vice versa, maybe a particular neighborhood has predictably expensive townhouses, and predictably cheap single family homes.
Multiple Linear Regression
For our base model, we used a multiple linear regression models. We selected 20 features that we believed were the most promising based on the EDA we performed as described previously. We further narrowed the list of 20 features by organizing them into five main categories based on our understanding of how a typical home buyer or investor would assess a home. The table below summarizes our analysis:
Location | Style | Condition | Size | Other |
Neighborhood | HouseStyle | OverallQual | GrLivArea | SaleCondition |
MSZoning | Foundation | OverallCond | 1stFlrSF, 2ndFlrSF | SaleType |
GarageFinish | YrBuilt | FullBath | ||
Paved Drive | ExterQual | TotRms | ||
BsmtQual | GarageCars | |||
GarageArea |
The features highlighted in red were the ones we ultimately selected to run in our initial linear regression model. With the exception of GrLivArea, all the features are categorical features. Before dummifying these categorical features, we further grouped the values of these categorical features into quantiles based on the relationship of each feature with the target variable. One reason is to limit the number of explanatory variables in our model after dummification so as to lower the possibility of multicollinearity issues. We also performed a variance inflation factor (“VIF”) analysis to gain further comfort. In general, a VIF score above 5 indicates that multi-collinearity might be an issue. After dummification, all the explanatory variables we chose had a VIF score below 5.
Our model selection process was the following:
- We split the training set 80%/20% into sub-training and test sets, respectively.
- We fit the model against the 80% sub-training set and then tested this against the remaining 20%.
- After testing, we then fit the model to the entire training set.
The R-squared values across the various models we trained and tested ranged from approximately 80 percent to 81 percent. As part of our residuals analysis after we fit the model against the entire training set, we examined whether there were any influential points that may have had an outsized influence on the regression. As the following graph shows, there are two observations (#523 and #1298) which stand out based on their influence as represented graphically by the sizes of their circles.
Based on further review, we noted that the sale prices for these two homes were very low relative to their living areas, even in comparison with other homes in the Edwards neighborhood, where they are located. Because we couldn’t detect any patterns or features that might explain this, we made the decision to exclude these two data points from our analysis. After re-running the regression without these two data points, we arrived at a R-squared value of around 81 percent for the entire training set. We further note that the regression is significant at the 5 percent significance level and that the features with the highest absolute beta coefficient values were the top quantiles in terms of quality of the home and the home sales price.
Ridge and Lasso Linear Regressions
The process we used to train and test the Ridge and Lasso linear regression models was similar to the one we used for the multiple linear regression model. The major difference was the further complication of tuning the model hyperparameter that affects the L1 and L2 penalty terms. The following was the process we used:
- We split the training set 80%/20% into sub-training and test sets, respectively.
- We used a grid search in combination with a 5-fold cross-validation process to select our hyperparameter.
- We fit the model against the 80% sub-training set and then tested this against the remaining 20%.
- After testing, we then fit the model to the entire training set.
Ridge Linear Regression:
For the Ridge linear regression model, we expanded our features list to include many of the interactions described previously. Given the presence of the L2 regularization term, we felt reasonably comfortable with expanding our features list. The following graph illustrates the result of our grid-search analysis used to tune the hyperparameter. We note that the optimal hyperparameter, in which the root mean squared error is at the minimum, is an alpha of 18.7.
Using an alpha of 18.7, we note that the R-squared values across the various models we trained and tested ranged from approximately 91 percent to 94 percent. We also note that the R-squared value is lower for the test set than the R-squared value for the training set, which indicates that there may be some overfitting. Consistent with the results from the baseline multiple linear regression model, the features with the highest absolute beta coefficient values were those related to the quality and condition of the homes, the neighborhood, and the interaction between them.
Lasso Linear Regression:
For the Lasso linear regression model, we regressed the same set of initial features from the Ridge model against home sales prices. We also employed the same hyperparameter tuning process but interestingly, the optimal hyperparameter for the L1 regularization term was much smaller at 0.01. The following line chart provides a graphical representation of our grid search results.
Using an alpha of 0.01, we note that the Lasso performed worse than the Ridge model in terms of predictive accuracy. The R-squared value for the Lasso models we trained and tested hovered around 84 percent. However, consistent with the results from the previous linear regression models, the features with the highest absolute beta coefficient values were those related to the quality and condition of the homes and neighborhood. It is also interesting to note that the combination of home sale type and sale condition had the highest absolute beta coefficient value.
Tree Based Models
Tree based methods empower predictive models with high accuracy, stability and ease of interpretation. Unlike linear models, they map non-linear relationships quite well. They are adaptable to solving any kind of problem at hand (classification or regression).
We developed 3 Classes of Tree based Models namely Decision Trees, Random Forest and Gradient Boosting. It was highly interesting to evaluate and compare the distinctive characteristics and performance of each of the 3 separate Tree based models.
In order to get a grasp on the underlying behaviors, we produced Plots of RMSLE against a range of Hyperparameters. These plots were instrumental in three ways:
- Visualizing the underlying trends as hyperparameters are varied
- Identifying the regions of Bias vs Variance trade-offs
- Acquiring an estimated range of values to feed into the Cross Validation Grid Search
Subsequently, Cross Validation Grid Search was performed on each Tree based Models with their respective Hyperparameters. This resulted in selection of the “best” Hyperparameters based upon the Accuracy against the Test Set.
It was interesting to note the distinction between each of the Tree based Models by examining their Prediction Profiles. Evidently, Decision Tree exhibits clear Discrete Steps in the Prediction Plots owing to its simplistic model. The Random Forests and Gradient Boosted models achieved a better fit to the actual Prediction Profile from the Training Set. This attributed to Bagging and Boosting in Random Forests and Gradient Boosting respectively.
Besides that, the Variable Importance Plots gave valuable, under the hood insights on the various Tree based models. It uncovers which Variables were ultimately utilized by the models and their relative importance. Some interesting takeaways:
- Decision Trees and Random Forests converged on the same number of Important Variables
- Random Forests resulted in a different order of Variable Importance. This is attributed to its Random Feature Selection at each split.
- Gradient Boosting incorporated the most number of Important Variables. The Boosting process enables “weaker” Features to be included alongside the “stronger” Features
Ultimately, the boosting combines weak learner to form a more Accurate and Robust Rule.
Ensemble Model
After creating the linear and tree-based models above, we decided to combine them in an ensemble in order to boost the prediction accuracy and improve the overall confidence level of the predictions since the different models capture different trends in the dataset making the ensemble more robust to aspects like outliers which make predictions from linear models unstable, the discrete and non extrapolatable predictions from tree models which limit the prediction values to the range of values in the classification box etc.
A number of transformations and imputations were made to the dataset in this stage in addition to those made in the earlier stages before running the models in the ensemble. These included un-skewing all features with a skewness greater than 0.75 and removing outliers. These outliers were detected visually from the plots and confirmed using the Bonferroni outlier test.
We used the StackingRegressor from the mlxtend package. This took as input models the lasso, ridge and random forest models and used lasso regression as the second level model (meta-regressor). This is illustrated in the image below.
Results
RMSLE | |||
Models | Test | Kaggle | |
Linear Based | Multi-Linear | 0.1641 | 0.1788 |
Ridge Regression | 0.1110 | 0.1290 | |
Lasso Regression | 0.1370 | 0.1414 | |
Tree Based | Decision Trees | 0.1910 | 0.1882 |
Random Forests | 0.1358 | 0.1465 | |
Gradient Boosting | 0.1184 | 0.1245 | |
Ensemble | Ridge, Lasso, RF | 0.0949 | 0.1251 |
Conclusion
Based on our experience with this project, we were able to gain valuable insights on the application of machine learning models. For example, we discovered that feature engineering and hyper-parameter tuning proved to be vital steps and can have a big impact on the end performance of the machine learning models. Also Ridge regression and Lasso regression within the ensembling process generally outperformed the tree-based models. This is most likely due to the size and nature of the dataset, which appears to lend itself more to the application of regression models. The following is a summary table comparing the results of the various models that were employed in this project.