The Care & Handling of PCA VS Predicting Housing Prices

Posted on Aug 27, 2021

Kaggle Data | Github | LinkedIn

Introduction

Real estate, that being the market for land and housing, has long stood as one of the most influential and vast market forces in the known world since the signing of Magna Carta. That said, it is awfully fitting that data scientists cut their teeth with its study, particularly with regards to predicting housing prices and to ascertaining precisely what will affect an assessor's valuation of a given property. However, where data scientists and entrepreneurs typically studied the popular Boston Housing data set, Dr. Dean DeCock, currently a professor of Statistics at Truman University, IA, created the Ames Iowa Housing Data set for even more robust, regression practice.

Prospective Approach

Dr. DeCock provides a great deal of analysis in his own paper, illuminating the pattern behind some outliers as well as the relative importance of given features in his results. He particularly emphasizes the significance of Ground Living Area, Neighborhood, and total Lot Area. Thus, with this fore-knowledge in my study I hypothesized that we would see precisely this pattern, but that we would see more. Drawing from my own experience in exploring housing options at large, I further supposed that we might see relatively large importance in the quality of the Kitchen as well as in the date of a houses construction & remodeling.

While it is tempting to follow in the already well-trodden footsteps that came before, I began this project with the directive of approaching my data agnostically, in order to protect a best-practice approach - that is to approach my data in a naive fashion, graphing everything and comparing it carefully. Finally, having learned much of Unsupervised learning - PCA in particular - I wished to put these methods to the test to answer the question: how would my results fare by including PCA in the Pipeline processing?

Exploring the Data

The naive overview allows the differences between variables to stand out in stark contrast against each other.

Shortly after loading the data and performing the Train-Test split at the jump, I took the naive approach and graphed every variable available to me. With this cursory overview, I came across many of the usual suspects, i.e. junk features (e.g. Id), continuous features (e.g. LotArea), ordinal features (e.g. LandContour), categorical features (e.g. MSZoning), etc.

Considering that this is a well-renowned and well-rated Kaggle data set, which suggests that it is already fairly clean, relatively little stood out to me as particularly noteworthy at this point beyond the obvious and the few outliers that were apparent from the graphs. DeCock discusses these somewhat thoroughly in his paper, however, I will address two approaches to dealing with them myself in my own strategy.

Correlation Heatmap of Missinginess from the Missingno Package. The bluer an intersection, the more both columns are coincidentally missing. The redder, the more the opposite is true. Purely grey suggests either too few data to correlate or no correlation, as with Electrical. There are a great many other methods built into the Missingno suite and I explore their use further in my Jupyter Notebook.

With this correlated graph, 3 groupings of missing data become immediately apparent: Basement, Garage, & Masonry. In each of the three cases, their missingness is unsurprising. After all, a house that has no garage will have no information in any of the categories. Likewise with Masonry and Basement. Similarly we would not have to worry about Fireplace, Pool, Fence, or Alley.

There are, however, three special cases worth further explication:

  • MiscFeature: this column consists of house features that are rare but might have an impact on the pricing, such as an adjoined Tennis court. Surely few houses will have one, but it might almost certainly matter. Therefore the majority of the column will likely be missing with purpose.
  • Electrical: there was only one case of missingness in the Electrical column. Where it would be tempting to classify this as simple error, close inspection of the house in question reveals it was a converted shed, leaving us with good reason for its missingness.
  • LotFrontage: in this column alone do we see data that we can classify as MAR, or "Missing At Random." There did not seem to be any correlation between its missingness and any other particular feature, however, the concept of lot frontage - the sum length of contact between a lot and the street, measured in feet - logically always requires an answer dependent on its lot type.

Analysis & Strategy

Decisions of how to engineer and impute information in a dataset are entirely dependent on the outlook and analytical strategy one takes. Herein, to reiterate, my goal was to probe the usability of unsupervised learning, particularly PCA as well as the results spread across a number of classical models. These included Ridge Regression, Support Vector Regression, Random Forest Regression, and Gradient Boost Regression. Furthermore I wanted to contrast these two tree-based models with the performance of XGRandom Forest and XGBoost. This spread would allow me to examine classic linear regression as well as tree-based performance, offering contrasting methods of dealing with both collinearity and outliers.

I chose to take a general approach in order to test reliance on PCA for feature selection, only eliminating extremely choice variables - particularly noisy, secondary variables such as Secondary Basement Square Footage, which measured only the second type of flooring in the basement - ahead of the actual modeling process. In order to better correct our predictions, I would also employ SKlearn's GridSearchCV in order to optimize modeling parameters.

With all of that said, my imputations and modifications to the data pre-modeling can be summarized as follows:

  • Removed Id, BsmtFinSF1, BsmtFinSF2, 1stFlrSF1, 2ndFlrSF2 columns
  • Imputed all missing data, save for LotFrontage, with 0's and NA's respectively, prior to encoding
  • Lot Frontage imputed as Median of the column, encoded with the SimpleImputer. I conceived of the option of creating a bespoke function to impute the distinct medians relative to Lot Type, considering that the lot type would have significant effect on the number, however, I decided against the procedure as there were only roughly 70 points of missing data here.
  • Year columns altered to Years Since Built & Years Since Remodel respectively
  • Encoded Categorical information using OneHotEncoder & OrdinalEncoder for the respective data types.
  • Scaled the data appropriately using SKlearn's StandardScaler.

Modeling Iteration, the 1st

Over the course of this processing run, I chose to compare the results of each step side by side. These were: 1) Modeling Scaled Data, 2) Modeling Scaled & PCA'd Data, and 3) GridSearch optimized, Scaled, and PCA'd Data.

First Comparative Iteration

The standout result from these modeling runs was the score from the GradientBoost model without PCA - where a great many train scores hit close to or even perfect 100%, the vast majority turned out to be completely overfit to the train data.

The remaining models performed more or less as expected: without PCA, Ridge and SVR performed significantly worse and Tree-Based models remained strong. Pipelining PCA into the processing significantly reduced the overfits for non-Tree models, however, failed to really bring their scores into a decent range. Thus, of all of the scores, the GradientBoost model maintained its lead.

Iteration, the 2nd

Before ditching PCA entirely, I wanted to investigate the issue of outliers in the data - as mentioned by the author. Rather than arbitrarily identify outliers, I employed Cook's Distance to illuminate the data to be removed. Not only was I able to identify a few dozen outliers, but I was able to avoid eliminating potentially influential, yet useful data.

Cook's Distance of Training Data

The results, however, failed to yield much improvement. Both Ridge & SVR models saw some improvement where Tree-based models all saw poorer results. This pattern does make some intuitive sense: where the former model types are vulnerable to outliers, tree-based models, depending on their complexity, are less at risk in such an environment.

Second Comparative Iteration, a.k.a. the "Fully Cooked Iteration"

The Moral of the Story: the Dangers of Unsupervised Learning

The problem with our reliance on PCA is the assumption that it allows for automatic, hands-free feature engineering. Where PCA eliminates features according to the amount of variance explained, we are looking to retain good predictors of Price, not variance. So where PCA may prove useful as a tool to improve processing and explain elements of a data set, it clearly failed to perform as an automation of feature selection.

In order to confirm my final hypothesis, I ran my optimizations on the scaled data without PCA with great success.

Third Comparative Iteration using "UnCooked" data. To save time, I will here mention that the tree-based models performed worse without outliers, thus I stuck with the "Raw" results. If you wish to review those results, please consult the Github link above.

Here we can see a continuation and significant improvement on the initial, default train scores: SVR nearly matched Ridge in its score, as did XGBoost with GradientBoost. GradientBoost, however, remained triumphant with our strongest score yet at 92.8%.

What About Housing?

With these new scores, I immediately set to ascertaining the relative importance of our housing features. Rather than jump for the classically used Feature Importance, instead I employed SKlearn's permutation importance. While the latter provides similarly explainable results, it manages to avoid some bias involved in the former's processing.

As Dr. DeCock mentioned, we do indeed see outstanding importance placed upon the Ground Living Area, Lot Area, and Neighborhood - of the last point, we particularly see Crawford & Stonebrook having a significant impact. We also see Overall Quality and Total Basement Square Footage appearing with increased importance.

Towards my own, personal thesis, we also see Kitchen Quality, Years Since Remodeling, and Years Since Built appearing at relatively high importance alongside other assorted elements, which may be of interest in further study.

Conclusions: Reflections and Looking Forward

From this deep dive into the Housing sales of Ames Iowa, we have thus far learned a great deal concerning the caution and proper usage of unsupervised learning at the same time as confirming our initial hypotheses. Future iterations of this project would be best served by involving a more bespoke approach to feature engineering, carefully curating what elements remain so more refined comparison can be made between the results of linear and tree-based models. From there it might be of academic interest to employ further refinement through the employment of stacking.

At the end of the day, put in practice, methodologies like these will find the greatest use in supplementing the business plans of housing realtors at large, enabling them to realistically achieve a greater ROI by targeting particular aspects of their offerings for improvement. However, by the same token, such a tool could also be employed in the service of the prospective house owner, offering insights into the comparative, effective price of a given listing versus its projected value.

Considering the above range of applications, I hope that you have found this exploration and analysis illuminating. Myself, I look forward to developing this further - after all, I have yet to make a down-payment!

About Author

Theodore Cheek

Data Science & Machine Learning Engineer | A Passionate Puzzle-Solver and Pattern-Finder who enjoys translating data into clear and beautiful visualizations. Fluent with R, SQL, and Python.
View all posts by Theodore Cheek >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp