Data Analysis on Housing Prices Through PCA
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Kaggle Data | Github | LinkedIn
Introduction
Real estate, that being the market for land and housing, has long stood as one of the most influential and vast market forces in the known world since the signing of Magna Carta. That said, it is awfully fitting that data scientists cut their teeth with its study, particularly with regards to predicting housing prices and to ascertaining precisely what will affect an assessor's valuation of a given property. However, where data scientists and entrepreneurs typically studied the popular Boston Housing data set, Dr. Dean DeCock, currently a professor of Statistics at Truman University, IA, created the Ames Iowa Housing Data set for even more robust, regression practice.
Prospective Approach
Dr. DeCock provides a great deal of analysis in his own paper, illuminating the pattern behind some outliers as well as the relative importance of given features in his results. He particularly emphasizes the significance of Ground Living Area, Neighborhood, and total Lot Area. Thus, with this fore-knowledge in my study I hypothesized that we would see precisely this pattern, but that we would see more. Drawing from my own experience in exploring housing options at large, I further supposed that we might see relatively large importance in the quality of the Kitchen as well as in the date of a houses construction & remodeling.
While it is tempting to follow in the already well-trodden footsteps that came before, I began this project with the directive of approaching my data agnostically, in order to protect a best-practice approach - that is to approach my data in a naive fashion, graphing everything and comparing it carefully. Finally, having learned much of Unsupervised learning - PCA in particular - I wished to put these methods to the test to answer the question: how would my results fare by including PCA in the Pipeline processing?
Exploring the Data

Shortly after loading the data and performing the Train-Test split at the jump, I took the naive approach and graphed every variable available to me. With this cursory overview, I came across many of the usual suspects, i.e. junk features (e.g. Id), continuous features (e.g. LotArea), ordinal features (e.g. LandContour), categorical features (e.g. MSZoning), etc.
Considering that this is a well-renowned and well-rated Kaggle data set, which suggests that it is already fairly clean, relatively little stood out to me as particularly noteworthy at this point beyond the obvious and the few outliers that were apparent from the graphs. DeCock discusses these somewhat thoroughly in his paper, however, I will address two approaches to dealing with them myself in my own strategy.
Missingness Data

Data Findings
With this correlated graph, 3 groupings of missing data become immediately apparent: Basement, Garage, & Masonry. In each of the three cases, their missingness is unsurprising. After all, a house that has no garage will have no information in any of the categories. Likewise with Masonry and Basement. Similarly we would not have to worry about Fireplace, Pool, Fence, or Alley.
There are, however, three special cases worth further explication:
- MiscFeature: this column consists of house features that are rare but might have an impact on the pricing, such as an adjoined Tennis court. Surely few houses will have one, but it might almost certainly matter. Therefore the majority of the column will likely be missing with purpose.
- Electrical: there was only one case of missingness in the Electrical column. Where it would be tempting to classify this as simple error, close inspection of the house in question reveals it was a converted shed, leaving us with good reason for its missingness.
- LotFrontage: in this column alone do we see data that we can classify as MAR, or "Missing At Random." There did not seem to be any correlation between its missingness and any other particular feature, however, the concept of lot frontage - the sum length of contact between a lot and the street, measured in feet - logically always requires an answer dependent on its lot type.
Data Analysis & Strategy
Decisions of how to engineer and impute information in a dataset are entirely dependent on the outlook and analytical strategy one takes. Herein, to reiterate, my goal was to probe the usability of unsupervised learning, particularly PCA as well as the results spread across a number of classical models. These included Ridge Regression, Support Vector Regression, Random Forest Regression, and Gradient Boost Regression. Furthermore I wanted to contrast these two tree-based models with the performance of XGRandom Forest and XGBoost. This spread would allow me to examine classic linear regression as well as tree-based performance, offering contrasting methods of dealing with both collinearity and outliers.
I chose to take a general approach in order to test reliance on PCA for feature selection, only eliminating extremely choice variables - particularly noisy, secondary variables such as Secondary Basement Square Footage, which measured only the second type of flooring in the basement - ahead of the actual modeling process. In order to better correct our predictions, I would also employ SKlearn's GridSearchCV in order to optimize modeling parameters.
Imputations and Modifications
With all of that said, my imputations and modifications to the data pre-modeling can be summarized as follows:
- Removed Id, BsmtFinSF1, BsmtFinSF2, 1stFlrSF1, 2ndFlrSF2 columns
- Imputed all missing data, save for LotFrontage, with 0's and NA's respectively, prior to encoding
- Lot Frontage imputed as Median of the column, encoded with the SimpleImputer. I conceived of the option of creating a bespoke function to impute the distinct medians relative to Lot Type, considering that the lot type would have significant effect on the number, however, I decided against the procedure as there were only roughly 70 points of missing data here.
- Year columns altered to Years Since Built & Years Since Remodel respectively
- Encoded Categorical information using OneHotEncoder & OrdinalEncoder for the respective data types.
- Scaled the data appropriately using SKlearn's StandardScaler.
Modeling Iteration, the 1st
Over the course of this processing run, I chose to compare the results of each step side by side. These were: 1) Modeling Scaled Data, 2) Modeling Scaled & PCA'd Data, and 3) GridSearch optimized, Scaled, and PCA'd Data.

The standout result from these modeling runs was the score from the GradientBoost model without PCA - where a great many train scores hit close to or even perfect 100%, the vast majority turned out to be completely overfit to the train data.
The remaining models performed more or less as expected: without PCA, Ridge and SVR performed significantly worse and Tree-Based models remained strong. Pipelining PCA into the processing significantly reduced the overfits for non-Tree models, however, failed to really bring their scores into a decent range. Thus, of all of the scores, the GradientBoost model maintained its lead.
Iteration, the 2nd
Before ditching PCA entirely, I wanted to investigate the issue of outliers in the data - as mentioned by the author. Rather than arbitrarily identify outliers, I employed Cook's Distance to illuminate the data to be removed. Not only was I able to identify a few dozen outliers, but I was able to avoid eliminating potentially influential, yet useful data.

The results, however, failed to yield much improvement. Both Ridge & SVR models saw some improvement where Tree-based models all saw poorer results. This pattern does make some intuitive sense: where the former model types are vulnerable to outliers, tree-based models, depending on their complexity, are less at risk in such an environment.

The Moral of the Story: the Dangers of Unsupervised Learning
The problem with our reliance on PCA is the assumption that it allows for automatic, hands-free feature engineering. Where PCA eliminates features according to the amount of variance explained, we are looking to retain good predictors of Price, not variance. So where PCA may prove useful as a tool to improve processing and explain elements of a data set, it clearly failed to perform as an automation of feature selection.
In order to confirm my final hypothesis, I ran my optimizations on the scaled data without PCA with great success.

Here we can see a continuation and significant improvement on the initial, default train scores: SVR nearly matched Ridge in its score, as did XGBoost with GradientBoost. GradientBoost, however, remained triumphant with our strongest score yet at 92.8%.
What About Housing Data?
With these new scores, I immediately set to ascertaining the relative importance of our housing features. Rather than jump for the classically used Feature Importance, instead I employed SKlearn's permutation importance. While the latter provides similarly explainable results, it manages to avoid some bias involved in the former's processing.

As Dr. DeCock mentioned, we do indeed see outstanding importance placed upon the Ground Living Area, Lot Area, and Neighborhood - of the last point, we particularly see Crawford & Stonebrook having a significant impact. We also see Overall Quality and Total Basement Square Footage appearing with increased importance.
Towards my own, personal thesis, we also see Kitchen Quality, Years Since Remodeling, and Years Since Built appearing at relatively high importance alongside other assorted elements, which may be of interest in further study.
Conclusions: Reflections and Looking Forward
From this deep dive into the Housing sales of Ames Iowa, we have thus far learned a great deal concerning the caution and proper usage of unsupervised learning at the same time as confirming our initial hypotheses. Future iterations of this project would be best served by involving a more bespoke approach to feature engineering, carefully curating what elements remain so more refined comparison can be made between the results of linear and tree-based models. From there it might be of academic interest to employ further refinement through the employment of stacking.
At the end of the day, put in practice, methodologies like these will find the greatest use in supplementing the business plans of housing realtors at large, enabling them to realistically achieve a greater ROI by targeting particular aspects of their offerings for improvement. However, by the same token, such a tool could also be employed in the service of the prospective house owner, offering insights into the comparative, effective price of a given listing versus its projected value.
Considering the above range of applications, I hope that you have found this exploration and analysis illuminating. Myself, I look forward to developing this further - after all, I have yet to make a down-payment!