Studying Data to Predict Iowa Housing Prices

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Team Members:  Rajesh Arasada, Yung Cho, Nilesh Patel, Pankaj Sharma, Tim Waterman

Problem Definition

As a part of our curriculum at the NYC Data Science Academy 12-week bootcamp our team entered the House Prices: Advanced Regression techniques challenge on Kaggle. The dataset contained information from 1460 houses sold in Ames, Iowa between 2006 and 2010. The challenge represents a Supervised Machine Learning Regression problem as our team was asked to learn the mapping function from the 80 input features to the output (‘SalePrice’) which is a real value.

Our goal was to develop a model that is both accurate— in that it can predict the ‘Sales Price’ close to the true value — and interpretable — in that it helps buyers and sellers make informed decisions.

In this blog, our team will share the steps followed to build a predictive model in Python and R. Described below are some illustrative examples of the transformations applied.

Understanding the data

Since it is very cumbersome to explore all the features at once, our team broke up the task by dividing the data into smaller sub-sections of features for the purposes of examination. Our team paid particular attention to any feature that:

  • contained redundant information because they are highly correlated
  • had a high percentage of missing values
  • had only one value or one of the values with has insignificant frequency in the dataset

Our team combined the train and test datasets for exploratory data analysis, data cleaning and feature engineering. Some of our basic takeaways are as follows:

Redundant Features

Our team found two sets of features one describing the ‘basement’ and the other the ‘living area’ carry redundant information. '

'TotalBsmtSF' = BsmtFinSF1' + 'BsmtFinSF2' + 'BsmtUnfSF'  

'GrLivArea' =  '1stFlrSF' + '2ndFlrSF' + LowQualFinSF' 

Hence, our team only retained 'TotalBsmtSF' and 'GrLivArea'.

Missing Data and Imputation

One of the challenges of this dataset is the missing data (Figure 1). Over 80% of data is missing for variables like ‘Alley’, ‘Fence’, and ‘PoolQC’. The missing values indicate that these features are not present in the house. Our team imputed the missing values in these columns with ‘None’

All other features except for ‘LotFrontage’ were imputed with the value of the feature’s most frequently occurring value, or some hardcoded value based on what made the most sense in each situation.

To impute 485 missing values in ‘LotFrontage’ column our team relied on FancyImpute library and created a KNN model using all other predictors.

Studying Data to Predict Iowa Housing Prices

Figure 1: Heatmap of the Ames housing features showing the missingness in the data (left) before and (right) after data cleaning. Yellow shading represents the missing data points. The yellow shade in the heatmap on the right highlights the missing SalePrice information in the test dataset.

Feature Transformations

Our team engineered new features that can help increase the model’s performance, and created the following two new features:

  • ‘HouseAge’ = 2018 – YearBuilt and
  • ‘Bathrooms’ = BsmtFullBath + BsmtHalfBath * 0.5 + FullBath + HalfBath * 0.5

Treatment of categorical and ordinal variables

After all the features were created, our team used Label Encoding and One-Hot encoding to treat the ordinal and categorical features.

Correlations between Features and Target

Our team removed highly collinear variables. Collinear features increase model complexity and decrease model generalization. To quantify relationships between variables, our team plotted the correlation matrix of our dataset cleaned so far. Some of the basic takeaways from our correlation matrix are as follows:

  • GarageCars’ and ‘GarageArea’ are highly correlated with each other and the 'SalePrice'. Since the 'GarageCars' feature has a higher correlation with ‘SalePrice' than 'GarageArea' (0.640409 vs 0.623431) our team dropped 'GarageArea'.
  • Strong correlation is also seen between 'TotRmsAbvGrd' and 'GrLivArea'. 'GrLivArea' feature has a higher correlation with 'SalePrice' than 'TotRmsAbvGrd' (0.708624 vs 0.613581) hence our team dropped the 'TotRmsAbvGrd'.

Studying Data to Predict Iowa Housing PricesFigure 2: Correlation Matrix showing relationship between features

Outlier Treatment

After narrowing down on the variables that cause the maximum variance to the target variable ‘SalePrice’, our team removed extreme outliers from variables GrLivArea and TotalBsmtSF since they are highly correlated to the target variable. In total, our team removed in total 7 observations (extreme outliers) that were

< than first quartile - 3 * inter quartile range & > than third quartile + 3 * inter quartile range

Studying Data to Predict Iowa Housing Prices

Figure 3: Scatter plots before and after the removal of extreme outliers in GrLivArea and TotalBsmtSF

Transformation of Target Variable

Our team plotted the histogram plot of ‘SalePrice’ distribution and observed a positive skewness and used log transformation to convert the values to normally distributed values.

Studying Data to Predict Iowa Housing Prices


Figure 4: Density and probability plots of target variable before and after transformation

Machine Learning Data Set

Once all the features were created our dataset now has 356 features. Our team created multiple models to make predictions on the sale prices of the houses in Iowa

  1. Regularized Multiple Linear Regression
  2. Random Forest  (GridSearch)
  3. Stochastic Gradient Boosting (GridSearch)
  4. XGBoost (Stepwise Tuning)
  5. LightGBM (Grid Search, Random Search & Bayesian Hyperparameter Optimization)

Our dataset was split randomly into a 80% train dataset, and a 20% test dataset. Our team fit various models on the training dataset using 5-fold cross validation method to reduce the selection bias and reduce the variance in prediction power. We then used them to predict the outcomes of the residual test dataset in order to assess the accuracy and variance of our different models.

Multiple Linear Regression

We built a multi-variate linear regression including all the features in the dataset as used it as our baseline model. We built three regularized linear regression models with alpha chosen by cross-validation. The Elastic Net model performed the best of all the models on the test data as shown below.  The figure below shows the top 15 features that received a significant importance in the feature importance output in our LASSO model.

Figure 5: Feature importance from LASSO

Tree-based Models

Our team chose Decision Trees as our base model and then employed some of the more popular machine learning algorithms such as Random Forest, Gradient Boosting Machines, XGBoost and LightGBM. These models are a choice to compensate for overfitting seen with the Decision Trees.

Our team optimized the parameter using either GridSearch or Bayesian Optimization. Random Forest is an ensemble of Decision Trees, often trained with the “bagging” method. Random forest algorithm builds multiple decision trees using a random subset of features and merges them together to get a more accurate and stable prediction. In Gradient Boosting Machines, new models are sequentially added to correct for the errors made by the existing models until no further improvements can be made. They use gradient descent algorithm to minimize the loss when adding new models. Both XGBoost and LightGBM are known for their Execution Speed (as compared to decision trees) and Model Performance.

The figure below shows the feature importances  result of Random Forest and LightGBM models. Overall the feature importance is fairly similar between Random Forest and Stochastic Gradient Boosting trees . LightGBM seems to be a bit better predictor than Random Forest giving more importance to the number of Bedrooms over the OverallQual which is chosen as the most important feature by Random Forest.  

Figure 6: Feature Importances from Tree-bases models

Finally the table below summarizes all our results from different models

Model RMSE
Multiple Linear Regression 0.156
LASSO 0.125
Ridge 0.138
ElasticNet 0.127
Random Forest 0.149
Gradient Boosting Machine 0.128
LightGBM 0.129
XGBoost 0.140

Please feel free to reach out to us if you have any questions or concerns, Thank you!

About Authors

Rajesh Arasada

Data scientist and cell biologist with >10 years of bio-medical research experience. Implemented Machine learning (ML) algorithms in R and Python to solve real-world problems.
View all posts by Rajesh Arasada >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI