Studying Data to Predict Iowa Housing Prices

Rajesh Arasada, Yung Chou, Nilesh Patel, Pankaj Sharma and Tim Waterman

Posted on Jan 11, 2019

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Team Members: Rajesh Arasada, Yung Cho, Nilesh Patel, Pankaj Sharma, Tim Waterman

Problem Definition

As a part of our curriculum at the NYC Data Science Academy 12-week bootcamp our team entered the House Prices: Advanced Regression techniques challenge on Kaggle. The dataset contained information from 1460 houses sold in Ames, Iowa between 2006 and 2010. The challenge represents a Supervised Machine Learning Regression problem as our team was asked to learn the mapping function from the 80 input features to the output (‘SalePrice’) which is a real value.

Our goal was to develop a model that is both accurate— in that it can predict the ‘Sales Price’ close to the true value — and interpretable — in that it helps buyers and sellers make informed decisions.

In this blog, our team will share the steps followed to build a predictive model in Python and R. Described below are some illustrative examples of the transformations applied.

Understanding the data

Since it is very cumbersome to explore all the features at once, our team broke up the task by dividing the data into smaller sub-sections of features for the purposes of examination. Our team paid particular attention to any feature that:

contained redundant information because they are highly correlated
had a high percentage of missing values
had only one value or one of the values with has insignificant frequency in the dataset

Our team combined the train and test datasets for exploratory data analysis, data cleaning and feature engineering. Some of our basic takeaways are as follows:

Redundant Features

Our team found two sets of features one describing the ‘basement’ and the other the ‘living area’ carry redundant information. '

'TotalBsmtSF' = BsmtFinSF1' + 'BsmtFinSF2' + 'BsmtUnfSF'

'GrLivArea' = '1stFlrSF' + '2ndFlrSF' + LowQualFinSF'

Hence, our team only retained 'TotalBsmtSF' and 'GrLivArea'.

Missing Data and Imputation

One of the challenges of this dataset is the missing data (Figure 1). Over 80% of data is missing for variables like ‘Alley’, ‘Fence’, and ‘PoolQC’. The missing values indicate that these features are not present in the house. Our team imputed the missing values in these columns with ‘None’

All other features except for ‘LotFrontage’ were imputed with the value of the feature’s most frequently occurring value, or some hardcoded value based on what made the most sense in each situation.

To impute 485 missing values in ‘LotFrontage’ column our team relied on FancyImpute library and created a KNN model using all other predictors.

Figure 1: Heatmap of the Ames housing features showing the missingness in the data (left) before and (right) after data cleaning. Yellow shading represents the missing data points. The yellow shade in the heatmap on the right highlights the missing SalePrice information in the test dataset.

Feature Transformations

Our team engineered new features that can help increase the model’s performance, and created the following two new features:

‘HouseAge’ = 2018 – YearBuilt and
‘Bathrooms’ = BsmtFullBath + BsmtHalfBath * 0.5 + FullBath + HalfBath * 0.5

Treatment of categorical and ordinal variables

After all the features were created, our team used Label Encoding and One-Hot encoding to treat the ordinal and categorical features.

Correlations between Features and Target

Our team removed highly collinear variables. Collinear features increase model complexity and decrease model generalization. To quantify relationships between variables, our team plotted the correlation matrix of our dataset cleaned so far. Some of the basic takeaways from our correlation matrix are as follows:

GarageCars’ and ‘GarageArea’ are highly correlated with each other and the 'SalePrice'. Since the 'GarageCars' feature has a higher correlation with ‘SalePrice' than 'GarageArea' (0.640409 vs 0.623431) our team dropped 'GarageArea'.
Strong correlation is also seen between 'TotRmsAbvGrd' and 'GrLivArea'. 'GrLivArea' feature has a higher correlation with 'SalePrice' than 'TotRmsAbvGrd' (0.708624 vs 0.613581) hence our team dropped the 'TotRmsAbvGrd'.

Figure 2: Correlation Matrix showing relationship between features

Outlier Treatment

After narrowing down on the variables that cause the maximum variance to the target variable ‘SalePrice’, our team removed extreme outliers from variables GrLivArea and TotalBsmtSF since they are highly correlated to the target variable. In total, our team removed in total 7 observations (extreme outliers) that were

< than first quartile - 3 * inter quartile range & > than third quartile + 3 * inter quartile range

Figure 3: Scatter plots before and after the removal of extreme outliers in GrLivArea and TotalBsmtSF

Transformation of Target Variable

Our team plotted the histogram plot of ‘SalePrice’ distribution and observed a positive skewness and used log transformation to convert the values to normally distributed values.

Figure 4: Density and probability plots of target variable before and after transformation

Machine Learning Data Set

Once all the features were created our dataset now has 356 features. Our team created multiple models to make predictions on the sale prices of the houses in Iowa

Regularized Multiple Linear Regression
Random Forest (GridSearch)
Stochastic Gradient Boosting (GridSearch)
XGBoost (Stepwise Tuning)
LightGBM (Grid Search, Random Search & Bayesian Hyperparameter Optimization)

Our dataset was split randomly into a 80% train dataset, and a 20% test dataset. Our team fit various models on the training dataset using 5-fold cross validation method to reduce the selection bias and reduce the variance in prediction power. We then used them to predict the outcomes of the residual test dataset in order to assess the accuracy and variance of our different models.

Multiple Linear Regression

We built a multi-variate linear regression including all the features in the dataset as used it as our baseline model. We built three regularized linear regression models with alpha chosen by cross-validation. The Elastic Net model performed the best of all the models on the test data as shown below. The figure below shows the top 15 features that received a significant importance in the feature importance output in our LASSO model.

Figure 5: Feature importance from LASSO

Tree-based Models

Our team chose Decision Trees as our base model and then employed some of the more popular machine learning algorithms such as Random Forest, Gradient Boosting Machines, XGBoost and LightGBM. These models are a choice to compensate for overfitting seen with the Decision Trees.

Our team optimized the parameter using either GridSearch or Bayesian Optimization. Random Forest is an ensemble of Decision Trees, often trained with the “bagging” method. Random forest algorithm builds multiple decision trees using a random subset of features and merges them together to get a more accurate and stable prediction. In Gradient Boosting Machines, new models are sequentially added to correct for the errors made by the existing models until no further improvements can be made. They use gradient descent algorithm to minimize the loss when adding new models. Both XGBoost and LightGBM are known for their Execution Speed (as compared to decision trees) and Model Performance.

The figure below shows the feature importances result of Random Forest and LightGBM models. Overall the feature importance is fairly similar between Random Forest and Stochastic Gradient Boosting trees . LightGBM seems to be a bit better predictor than Random Forest giving more importance to the number of Bedrooms over the OverallQual which is chosen as the most important feature by Random Forest.

Figure 6: Feature Importances from Tree-bases models

Finally the table below summarizes all our results from different models

Model	RMSE
Multiple Linear Regression	0.156
LASSO	0.125
Ridge	0.138
ElasticNet	0.127
Random Forest	0.149
Gradient Boosting Machine	0.128
LightGBM	0.129
XGBoost	0.140

Please feel free to reach out to us if you have any questions or concerns, Thank you!

About Authors

Rajesh Arasada

Data scientist and cell biologist with >10 years of bio-medical research experience. Implemented Machine learning (ML) algorithms in R and Python to solve real-world problems.

View all posts by Rajesh Arasada >

Studying Data to Predict Iowa Housing Prices

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Team Members: Rajesh Arasada, Yung Cho, Nilesh Patel, Pankaj Sharma, Tim Waterman

Problem Definition

Understanding the data

Redundant Features

Missing Data and Imputation

Feature Transformations

Treatment of categorical and ordinal variables

Correlations between Features and Target

Outlier Treatment

Transformation of Target Variable

Machine Learning Data Set

Multiple Linear Regression

Tree-based Models

Figure 6: Feature Importances from Tree-bases models

About Authors

Rajesh Arasada

Yung Chou

Nilesh Patel

Pankaj Sharma

Tim Waterman

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Studying Data to Predict Iowa Housing Prices

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Team Members: Rajesh Arasada, Yung Cho, Nilesh Patel, Pankaj Sharma, Tim Waterman

Problem Definition

Understanding the data

Redundant Features

Missing Data and Imputation

Feature Transformations

Treatment of categorical and ordinal variables

Correlations between Features and Target

Outlier Treatment

Transformation of Target Variable

Machine Learning Data Set

Multiple Linear Regression

Tree-based Models

Figure 6: Feature Importances from Tree-bases models

About Authors

Rajesh Arasada

Yung Chou

Nilesh Patel

Pankaj Sharma

Tim Waterman

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!