Thinking outside the House: predicting Ames homes prices at the brink of the Great Recession

 

Abstract:

Created a supervised machine learning model to predict housing prices in Ames, Iowa.

Sections:

-          EDA

-          Preprocessing

-          Modeling

Goal: We sought to utilize data from 2006-2010 to develop a supervised machine learning algorithm to predict housing prices in light of the 2008 recession. Our process consisted of exploratory data analysis, feature engineering, pre-processing, model testing and validation.

EDA:

As part of the Kaggle competition, we were given 2 datasets (train and test). Our training dataset consisted of 1459 rows and 80 columns with “sale price” being the target variable.

First, we split up the data into categorical and numerical variables. Then we further divided the categorical variables into ordinal (suggesting an ordered relationship, ex. “Overall rating from 1-10”) and nominal (no clear relationship).

Looking at the histograms allowed us to see the distributions of the variable values and alerted us to certain considerations. For instance, knowing variables were unevenly skewed helped us decide which columns might be logical to impute missing values with a mode or mean. Also, variables with a high degree of missingness let us know which columns could be dropped from the dataset.

To better understand our data, we plotted a correlation matrix to better understand the relationship between various variables and see which ones could be the best indicators for ‘Sale Price’ as well as which ones may be multicollinear in such a linear regression (to be seen later). Once we had our initial data explored, it was time to preprocess our data and see how we could create the best predictive model. 

Thinking outside the house

During the initial EDA phase, we observed that features included in our data set only described or measured a property’s characteristics. There were no outside factors included, such as external market forces, that could significantly influence the sale prices. The time period of data set covered was 2006 - 2010. In 2008, the average sale price of homes dropped by almost $10k. Average sale price of homes remained below $180k for rest of time period.

Ames_IA_AvgSalePrice_byYear

A major discussion within our group was how this time period coincides with two major events. The first event was the Housing Crisis in late 2007, which ultimately led to the second major event, the Great Recession in late 2008. Our general consensus was that we needed to somehow account for these events. We ultimately added two index measures that could account for any potential market volatility and inflationary risks: the VIX and CPI.

Why VIX? The VIX, also known as the ‘Fear Index’, is a measure of volatility in the general stock market. The figure actually represents the prediction of annualized change in the S&P 500 index in next 30 days, based on the buy/sell activity of S&P 500 options.  Analyst use VIX as a gauge of trader sentiment - higher VIX figures, specifically greater than 20, coincides with a higher level of fear in the market.

Plot 60

After adding VIX figures for time period covered, we found that VIX generally has an inverse relationship with the monthly average sale price of homes. VIX ultimately contributed to our prediction models in the end.

Why CPI? CPI, or the Consumer Pricing Index, is a popular economic indicator used to identify the trend of inflation. Economists use the annual change in CPI as the inflation rate, which we wanted to account for in our model. Unfortunately, CPI didn’t show any significant relationship with housing prices, and was ultimately dropped from our final model.

Module:

We built our own pre-processing module (see our GitHub: https://github.com/michaelywang/Ames_House_Prices ) to do all the data processing to simplify 

 

Numerical preprocessing

Missing Values

For Numerical variables we imputed with the mean. For categorical variables we imputed with the mode oftentimes though there were few enough we could look specifically and fill in on a case-by-case basis.

 Log Transformation

We log transformed Sale Price because a linear model would work better if the Sale Price had a normal distribution.

Ames_IA_GrLivSpace_vs_logSalePrice_scatter

Categorical preprocessing

Ordinal labeling

NaN Po Fa TA Gd Ex
None Poor Fair Average Good Excellent
0 1 2 3 4 5

When a categorical variable has a clear value order, we wanted to manually encode values for it. 

For features explaining quality, we encoded "None" as 0, "Poor" as 1, etc. up to "Excellent" as 6. The problem would be that the gaps between each are uncertain.

Plot 1

One Hot Encoding

NoRidge MeadowV Gilbert
1 0 0
0 1 0
0 0 1

Dummified columns (changed to binary yes no for each component) so they would have their own coefficient.

Testing

How did we find our optimum model?

We started with testing a multiple linear regression on our fully preprocessed data set to get an idea of performance and set a baseline to work off of. Without any transformations to the data, we wanted to see the coefficients that would have a significant impact on the sale price. As mentioned in our exploratory data analysis section, sale price itself had a right skew and in anticipation that transforming our response variable would improve our model, we corrected for this skewness, alone leading to an improvement in the test score of .03.

 

 

Linear Regression

One key problem in creating an optimum predictive model is to account for multicollinearity. This occurs when two predictors are highly correlated with one another introducing redundancy; which can lead to a higher R^2 score but shows that the data is being fit to noise rather than actual data. 

Normalization

Additionally, in a linear regression model there is need for normalization so the scale of the units doesn’t exaggerate the effect of certain features in comparison to others. For example, if one feature is measured in inches and another in yards, the inch coefficient will be much smaller to compensate and thus this feature would have a lot smaller impact on the price prediction. Instead, we use normalization to make all the coefficients vary by their standard deviations from the mean. This allows us to use a proper multiple linear regression.

Regularized Regression

To simplify our model and see which features had the best predictive capability for sale price we used a series of penalized models to reduce variance (and thus the risk of overfitting) and increase bias to limit the effect of the features on sale price prediction. By doing so, we found that our best model was Lasso which dropped a number of features that did not correlate very strongly with sale price. 

Briefly, the difference between the Lasso and Ridge models we tested was that the Lasso model will decrease coefficients all the way to zero (eliminating the impact of a feature on the predicted sale price) while Ridge would just decrease the coefficient to a very small scale but not eliminate any feature. Elastic Net served as a combination of the two and we wanted to try each combination to see which would give us the best overall result when compared to the kaggle test data set.  

Cross-Validation

In order to properly tune our model, we sought out to find the most optimal hyperparameters(lambda/rho) that would be utilized in our penalized regression equation. A 5-fold cross validation was performed for  Lasso, Ridge and Elastic Net to find the best hyperparameters that’ll eventually facilitate the model-selection process. We first iterated through a smaller range of values in order to not exhaust our computational resources, then further being more specific with our iterations to find the best hyperparameter.

Conclusion:

How did we do on the Kaggle Score?

1 - R^2 (of our predicted Y vs. the True Test Y) = 0.127

indicating a 87% accurate prediction rate for sale price given these variables.

We found our best model used Lasso reducing our predictors to __ . We were quite impressed with the predictive capability of a “simple” model of regularized regression that performed as well or better than most of our peers that tried more complex models. This speaks to the fact that each data set should be observed on a case-by-case basis to find the best model for each situation. For future directions, we could test more measures of market features to see their impact on creating a more holistic understanding of the housing market.

About Authors

Precious Chima

Precious Chima

Precious Chima is NYC Data Science Fellow with a Bachelors Degree in Applied Mathematics & Statistics from Stony Brook University. Prior to enrolling in the NYCDSA, he worked in the Oil & Gas industry, specializing in optimizing drilling...
View all posts by Precious Chima >
Daniel Avila

Daniel Avila

Data science professional with extensive BI experience in the financial sector. Always build solutions with scale and practicality in mind. Constantly dig for the stories that lie within data, and love sharing these discoveries. Self-proclaimed NBA historian. Loyal...
View all posts by Daniel Avila >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp