# Predicting Iowa Housing Prices

Team Members: Β Rajesh Arasada, Yung Cho, Nilesh Patel, Pankaj Sharma, Tim Waterman

# Problem Definition

As a part of our curriculum at the NYC Data Science Academy 12-week bootcamp our team entered the House Prices: Advanced Regression techniques challenge on Kaggle. The dataset contained information from 1460 houses sold in Ames, Iowa between 2006 and 2010. The challenge represents a Supervised Machine Learning Regression problem as our team was asked to learn the mapping function from the 80 input features to the output (βSalePriceβ) which is a real value. Our goal was to develop a model that is both accurateββin that it can predict the βSales Priceβ close to the true valueβββand interpretableβββin that it helps buyers and sellers make informed decisions.

In this blog, our team will share the steps followed to build a predictive model in Python and R. Described below are some illustrative examples of the transformations applied.

# Understanding the data

Since it is very cumbersome to explore all the features at once, our team broke up the task by dividing the data into smaller sub-sections of features for the purposes of examination. Our team paid particular attention to any feature that:

• contained redundant information because they are highly correlated
• had a high percentage of missing values
• had only one value or one of the values with has insignificant frequency in the dataset

Our team combined the train and test datasets for exploratory data analysis, data cleaning and feature engineering. Some of our basic takeaways are as follows:

### Redundant Features

Our team found two sets of features one describing the βbasementβ and the other the βliving areaβ carry redundant information. '

'TotalBsmtSF' = BsmtFinSF1' + 'BsmtFinSF2' + 'BsmtUnfSF'Β Β

'GrLivArea' =Β  '1stFlrSF' + '2ndFlrSF' + LowQualFinSF'Β

Hence, our team only retained 'TotalBsmtSF' and 'GrLivArea'.

### Missing Data and Imputation

One of the challenges of this dataset is the missing data (Figure 1). Over 80% of data is missing for variables like βAlleyβ, βFenceβ, and βPoolQCβ. The missing values indicate that these features are not present in the house. Our team imputed the missing values in these columns with βNoneβ

All other features except for βLotFrontageβ were imputed with the value of the featureβs most frequently occurring value, or some hardcoded value based on what made the most sense in each situation.

To impute 485 missing values in βLotFrontageβ column our team relied on FancyImpute library and created a KNN model using all other predictors.

Figure 1:Β Heatmap of the Ames housing features showing the missingness in the data (left) before and (right) after data cleaning. Yellow shading represents the missing data points. The yellow shade in the heatmap on the right highlights the missing SalePrice information in the test dataset.

### Feature Transformations

Our team engineered new features that can help increase the modelβs performance, and created the following two new features:

• βHouseAgeβ = 2018 β YearBuilt and
• βBathroomsβ = BsmtFullBath + BsmtHalfBath * 0.5 + FullBath + HalfBath * 0.5

### Treatment of categorical and ordinal variables

After all the features were created, our team used Label Encoding and One-Hot encoding to treat the ordinal and categorical features.

### Correlations between Features and Target

Our team removed highly collinear variables. Collinear features increase model complexity and decrease model generalization. To quantify relationships between variables, our team plotted the correlation matrix of our dataset cleaned so far. Some of the basic takeaways from our correlation matrix are as follows:

• GarageCarsβ and βGarageAreaβ are highly correlated with each other and the 'SalePrice'. Since the 'GarageCars' feature has a higher correlation with βSalePrice' than 'GarageArea' (0.640409 vs 0.623431) our team dropped 'GarageArea'.
• Strong correlation is also seen between 'TotRmsAbvGrd' and 'GrLivArea'. 'GrLivArea' feature has a higher correlation with 'SalePrice' than 'TotRmsAbvGrd' (0.708624 vs 0.613581) hence our team dropped the 'TotRmsAbvGrd'.

Figure 2:Β Correlation Matrix showing relationship between features

### Outlier Treatment

After narrowing down on the variables that cause the maximum variance to the target variable βSalePriceβ, our team removed extreme outliers from variables GrLivArea and TotalBsmtSF since they are highly correlated to the target variable. In total, our team removed in total 7 observations (extreme outliers) that were

< than first quartile - 3 * inter quartile range & > than third quartile + 3 * inter quartile range

Figure 3: Scatter plots before and after the removal of extreme outliers in GrLivArea and TotalBsmtSF

### Transformation of Target Variable

Our team plotted the histogram plot of βSalePriceβ distribution and observed a positive skewness and used log transformation to convert the values to normally distributed values.

Β

Figure 4: Density and probability plots of target variable before and after transformation

# Machine Learning

Once all the features were created our dataset now has 356 features. Our team created multiple models to make predictions on the sale prices of the houses in Iowa

1. Regularized Multiple Linear Regression
2. Random ForestΒ  (GridSearch)
4. XGBoost (Stepwise Tuning)
5. LightGBM (Grid Search, Random Search & Bayesian Hyperparameter Optimization)

Our dataset was split randomly into a 80% train dataset, and a 20% test dataset. Our team fit various models on the training dataset using 5-fold cross validation method to reduce the selection bias and reduce the variance in prediction power. We then used them to predict the outcomes of the residual test dataset in order to assess the accuracy and variance of our different models.

### Multiple Linear Regression

We built a multi-variate linear regression including all the features in the dataset as used it as our baseline model. We built three regularized linear regression models with alpha chosen by cross-validation. The Elastic Net model performed the best of all the models on the test data as shown below.Β  The figure below shows the top 15 features that received a significant importance in the feature importance output in our LASSO model.

Figure 5: Feature importance from LASSO

### Tree-based Models

Our team chose Decision Trees as our base model and then employed some of the more popular machine learning algorithms such as Random Forest, Gradient Boosting Machines, XGBoost and LightGBM. These models are a choice to compensate for overfitting seen with the Decision Trees.

Our team optimized the parameter using either GridSearch or Bayesian Optimization. Random Forest is an ensemble of Decision Trees, often trained with the βbaggingβ method. Random forest algorithm builds multiple decision trees using a random subset of features and merges them together to get a more accurate and stable prediction. In Gradient Boosting Machines, new models are sequentially added to correct for the errors made by the existing models until no further improvements can be made. They use gradient descent algorithm to minimize the loss when adding new models. Both XGBoost and LightGBM are known for their Execution Speed (as compared to decision trees) and Model Performance.

The figure below shows the featureΒ importancesΒ  result of Random Forest and LightGBM models. Overall the feature importance is fairly similar between Random Forest and Stochastic Gradient Boosting trees . LightGBM seems to be a bit better predictor than Random Forest giving more importance to the number of Bedrooms over the OverallQual which is chosen as the most important feature by Random Forest.Β Β

Figure 6: Feature Importances from Tree-bases models

Finally the table below summarizes all our results from different models

 Model RMSE Multiple Linear Regression 0.156 LASSO 0.125 Ridge 0.138 ElasticNet 0.127 Random Forest 0.149 Gradient Boosting Machine 0.128 LightGBM 0.129 XGBoost 0.140

Please feel free to reach out to us if you have any questions or concerns, Thank you!

Data scientist and cell biologist with >10 years of bio-medical research experience. Implemented Machine learning (ML) algorithms in R and Python to solve real-world problems.
View all posts by Rajesh Arasada >

### Yung Chou

View all posts by Yung Chou >

### Nilesh Patel

View all posts by Nilesh Patel >

### Pankaj Sharma

View all posts by Pankaj Sharma >

### Tim Waterman

View all posts by Tim Waterman >