Studying Data to Predict Iowa Housing Prices
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Team Members: Rajesh Arasada, Yung Cho, Nilesh Patel, Pankaj Sharma, Tim Waterman
As a part of our curriculum at the NYC Data Science Academy 12-week bootcamp our team entered the House Prices: Advanced Regression techniques challenge on Kaggle. The dataset contained information from 1460 houses sold in Ames, Iowa between 2006 and 2010. The challenge represents a Supervised Machine Learning Regression problem as our team was asked to learn the mapping function from the 80 input features to the output (‘SalePrice’) which is a real value.
Our goal was to develop a model that is both accurate— in that it can predict the ‘Sales Price’ close to the true value — and interpretable — in that it helps buyers and sellers make informed decisions.
In this blog, our team will share the steps followed to build a predictive model in Python and R. Described below are some illustrative examples of the transformations applied.
Understanding the data
Since it is very cumbersome to explore all the features at once, our team broke up the task by dividing the data into smaller sub-sections of features for the purposes of examination. Our team paid particular attention to any feature that:
- contained redundant information because they are highly correlated
- had a high percentage of missing values
- had only one value or one of the values with has insignificant frequency in the dataset
Our team combined the train and test datasets for exploratory data analysis, data cleaning and feature engineering. Some of our basic takeaways are as follows:
Our team found two sets of features one describing the ‘basement’ and the other the ‘living area’ carry redundant information. '
'TotalBsmtSF' = BsmtFinSF1' + 'BsmtFinSF2' + 'BsmtUnfSF'
'GrLivArea' = '1stFlrSF' + '2ndFlrSF' + LowQualFinSF'
Hence, our team only retained 'TotalBsmtSF' and 'GrLivArea'.
Missing Data and Imputation
One of the challenges of this dataset is the missing data (Figure 1). Over 80% of data is missing for variables like ‘Alley’, ‘Fence’, and ‘PoolQC’. The missing values indicate that these features are not present in the house. Our team imputed the missing values in these columns with ‘None’
All other features except for ‘LotFrontage’ were imputed with the value of the feature’s most frequently occurring value, or some hardcoded value based on what made the most sense in each situation.
To impute 485 missing values in ‘LotFrontage’ column our team relied on FancyImpute library and created a KNN model using all other predictors.
Figure 1: Heatmap of the Ames housing features showing the missingness in the data (left) before and (right) after data cleaning. Yellow shading represents the missing data points. The yellow shade in the heatmap on the right highlights the missing SalePrice information in the test dataset.
Our team engineered new features that can help increase the model’s performance, and created the following two new features:
- ‘HouseAge’ = 2018 – YearBuilt and
- ‘Bathrooms’ = BsmtFullBath + BsmtHalfBath * 0.5 + FullBath + HalfBath * 0.5
Treatment of categorical and ordinal variables
After all the features were created, our team used Label Encoding and One-Hot encoding to treat the ordinal and categorical features.
Correlations between Features and Target
Our team removed highly collinear variables. Collinear features increase model complexity and decrease model generalization. To quantify relationships between variables, our team plotted the correlation matrix of our dataset cleaned so far. Some of the basic takeaways from our correlation matrix are as follows:
- GarageCars’ and ‘GarageArea’ are highly correlated with each other and the 'SalePrice'. Since the 'GarageCars' feature has a higher correlation with ‘SalePrice' than 'GarageArea' (0.640409 vs 0.623431) our team dropped 'GarageArea'.
- Strong correlation is also seen between 'TotRmsAbvGrd' and 'GrLivArea'. 'GrLivArea' feature has a higher correlation with 'SalePrice' than 'TotRmsAbvGrd' (0.708624 vs 0.613581) hence our team dropped the 'TotRmsAbvGrd'.
After narrowing down on the variables that cause the maximum variance to the target variable ‘SalePrice’, our team removed extreme outliers from variables GrLivArea and TotalBsmtSF since they are highly correlated to the target variable. In total, our team removed in total 7 observations (extreme outliers) that were
< than first quartile - 3 * inter quartile range & > than third quartile + 3 * inter quartile range
Figure 3: Scatter plots before and after the removal of extreme outliers in GrLivArea and TotalBsmtSF
Transformation of Target Variable
Our team plotted the histogram plot of ‘SalePrice’ distribution and observed a positive skewness and used log transformation to convert the values to normally distributed values.
Figure 4: Density and probability plots of target variable before and after transformation
Machine Learning Data Set
Once all the features were created our dataset now has 356 features. Our team created multiple models to make predictions on the sale prices of the houses in Iowa
- Regularized Multiple Linear Regression
- Random Forest (GridSearch)
- Stochastic Gradient Boosting (GridSearch)
- XGBoost (Stepwise Tuning)
- LightGBM (Grid Search, Random Search & Bayesian Hyperparameter Optimization)
Our dataset was split randomly into a 80% train dataset, and a 20% test dataset. Our team fit various models on the training dataset using 5-fold cross validation method to reduce the selection bias and reduce the variance in prediction power. We then used them to predict the outcomes of the residual test dataset in order to assess the accuracy and variance of our different models.
Multiple Linear Regression
We built a multi-variate linear regression including all the features in the dataset as used it as our baseline model. We built three regularized linear regression models with alpha chosen by cross-validation. The Elastic Net model performed the best of all the models on the test data as shown below. The figure below shows the top 15 features that received a significant importance in the feature importance output in our LASSO model.
Our team chose Decision Trees as our base model and then employed some of the more popular machine learning algorithms such as Random Forest, Gradient Boosting Machines, XGBoost and LightGBM. These models are a choice to compensate for overfitting seen with the Decision Trees.
Our team optimized the parameter using either GridSearch or Bayesian Optimization. Random Forest is an ensemble of Decision Trees, often trained with the “bagging” method. Random forest algorithm builds multiple decision trees using a random subset of features and merges them together to get a more accurate and stable prediction. In Gradient Boosting Machines, new models are sequentially added to correct for the errors made by the existing models until no further improvements can be made. They use gradient descent algorithm to minimize the loss when adding new models. Both XGBoost and LightGBM are known for their Execution Speed (as compared to decision trees) and Model Performance.
The figure below shows the feature importances result of Random Forest and LightGBM models. Overall the feature importance is fairly similar between Random Forest and Stochastic Gradient Boosting trees . LightGBM seems to be a bit better predictor than Random Forest giving more importance to the number of Bedrooms over the OverallQual which is chosen as the most important feature by Random Forest.
Finally the table below summarizes all our results from different models
|Multiple Linear Regression||0.156|
|Gradient Boosting Machine||0.128|
Please feel free to reach out to us if you have any questions or concerns, Thank you!