No Two Datasets Or Houses Are Alike: Piloting A New Method for Machine Learning Real Estate Analysis
What are the Most Important Insights from this Project?
Machine Learning is a powerful predictive tool, but one that requires a significant amount of pre-processing work, mindful model selection, and resource-intensive hyperparameter tuning. As such, it is important to have significant and targeted questions in mind prior to starting this endeavor. Thankfully, with the help of machine learning models, this housing dataset from the town of Ames, Iowa lends itself to valuable and actionable insights that can be applied to other real estate markets. Housing affordability has become an even more important issue in the 11 years since this data was compiled, and insights from this data can help point to what elements of a house are the most likely to influence its price. This type of information is critical for sellers and realtors looking for a marketing edge or deciding what to remodel before putting a house on the market, as well as for developers deciding which features to include in a new property. In addition, buyers can use machine learning-generated price prediction data to have a better sense of what a house should be priced at, helping them negotiate down overpriced houses and identify great deals on underpriced ones. Together, these insights help all players in the real estate market make well-informed decisions in order to maximize their return on investment and use their resources wisely.
How Reliable is the Data?
Anytime data is collected across a long period of time, one should consider if there were any changes in the environment over time that may affect its uniformity or create bias. This real estate dataset was collected during this most volatile period in the American housing market this century – the Great Recession of 2008. With sale price data from 2006-2010, it both encompasses the height of the housing market bubble (2006-2008) and the trough of the housing market crash (2008-2010). As such, I immediately was concerned that this data would show a wild price fluctuation between these market extremes on either side of 2008, as was seen in much of the country; however, prices in Ames seemed to have been largely unaffected by the Great Recession. As such, no weighting by year or subdivisions of this data appear necessary to build an accurate price prediction model.
The stability seen above suggests that the housing preferences and trends seen in Ames were largely immune to the state of the national housing market. This could be for a few reasons:
- It could be that subprime mortgages were not as common in a small city such as Ames as it was in larger metropolitan areas, and thus Ames was spared from the worst of the bubble bursting.
- The box plot shows that housing prices did not seem to significantly rise in the years leading to the crash, which would mean that no significant correction was required to bring housing prices down.
- The data was modified or edited to be more uniform, and may not be an accurate representation of the actual housing market.
Initial additional research showed that non-urban areas were largely unaffected by the 2008 real estate bubble, and therefore less susceptible to its collapse. While Ames is a city, it only has roughly 34,000 permanent residents, so it is on the smaller end of urban environments and its housing market may have more shared characteristics with a rural environment. As such, some combination of the first two arguments seems more likely than a flaw in the data.
While this uniformity allows us to consider all housing prices equally no matter which year, it also means that this dataset is not representative of housing prices for much of the country during the time in which it was collected. While many conclusions drawn from this project may hold true for housing sales anywhere, due to the disconnect between the Ames housing market and the majority of the US market at this time, any sale price conclusions drawn should not be assumed to be universally applicable. However, since the data was so steady across a tumultuous time, one can infer that insights of what are the most valuable features for a house to have are consistent no matter the volatility of the market, and are worth considering beyond the scope of Ames, Iowa.
Why is there so much Pre-Processing?
As with so much of Data Science, in Machine Learning the majority of the work of the data scientist is spent cleaning and pre-processing the data. Data that is full of noise, redundant features, obvious outliers, and sloppy dummification is not salvageable by even the most complex machine learning model. Ultimately, the more representative the training dataset is, the more accurate a machine learning algorithm’s predictive model will be. As such, I tried to take extra care on this project to eliminate redundant features, create consolidated features that were potentially more relevant to a buyer/seller, eliminate obvious house outliers that would increase variance, and remove information that did not have sufficient representation in the data to be relevant beyond noise in the predictive model. Together, these pre-processing tasks helped improve R2 performance of an eventual model by as much as 0.15 – all before any additional transformations, machine learning model selection, or cross validation were applied to the dataset.
How to Wrangle Meaning from the Data
Data cleaning ultimately boils down to engineering the most meaningful features possible with the least possible noise. To this effect, my data pre-processing ultimately focused on four components:
- Using Feature Consolidation to Eliminate Redundancy and Reduce Multicollinearity
- Building More Meaningful “Big-Picture” Features thhmrough Feature Creation & Compilation
- Filtering Noisy Classes by Frequency to Avoid Meaningless Dummified Features when preparing Categorical Data for Machine Learning Models
- Eliminating Outlier Observations using Cook’s Distance
Using Feature Consolidation to Eliminate Redundancy and Reduce Multicollinearity
One of the first areas of concern I noticed in the Ames data was the tendency to represent the same information in more than one location. Not only can this issue lead to unnecessary dimension expansion when dummifying identical classes among different features, it can also lead to issues of multicollinearity by having the same information represented multiple times in the dataset. This appeared in one of three ways. The first way is when different features overlap in information, leading to similar or identical classes in each. ‘MSSubClass’, which identifies the type of dwelling involved in the sale, ‘HouseStyle’, and ‘BldgType’ all have significant overlaps with one another. Both ‘MSSubClass’ and ‘HouseStyle’ have a class representing ‘Split Foyer’, ‘MSSubClass’ and ‘BldgType’ both have a class representing ‘Duplex’. In order to deal with these overlaps (and many more), I created new machine learning-ready ‘ClassSFoyer’ and ‘ClassDuplex’ binary features that would show the house’s information no matter where it was originally found. This allows for a clearer understanding of the data, as some homes were listed as duplexes in ‘BldgType’ but not in ‘MSSubClass’, and vice versa. With this consolidation, one would know which units are duplexes and could find them all in a single feature. This also lets one see multiple features of the house easily, as some duplexes do have a split foyer layout, but there is not a lot of consistency on which feature would contain each of those pieces of information. I also eliminated or broadened some overly detailed classes at this stage. 1-story houses got their own feature, which had to take information from several classes. ‘MSSubClass’ separated houses built before 1945 with those that came after, but I absorbed both groups into my 1-story feature, as the 1945 division felt arbitrary – there is no meaningful difference between a 1944 house and a 1946 house in construction materials or value. I additionally grabbed the ‘1Story’ class from ‘HouseStyle’ as well as the ‘1-Story with Finished Attic’ and ‘1-Story PUD (Planned Unit Development)’ classes from ‘MSSubClass’. If any of these 5 classes showed up, this new ‘Class1Str’ feature would show a 1, creating an important big-picture takeaway that any person can relate to and understand without a data dictionary.
The second type of data overlap I found was paired features with the same class choices, such as the material used on the exterior of the house (‘Exterior1st’ / ‘Exterior2nd’) or what conditions a house was in proximity of (‘Condition1’ / ‘Condition2’). The issue at hand was the same class could be listed in Feature1 for one property and in Feature2 for a different property. If one simply dummified these paired features, there would be multiple new binary features created conveying identical information, as each feature's identical class would be dummified. Furthermore, since this dataset will repeat a value for the 2nd column if there is no second Condition/Exterior, one cannot assume these two totals are independent. As such, I needed to create my own dummified features by using numpy ‘np.where’ conditional statements to show if a class common to both Feature1 and Feature2 appeared in either feature.
Finally, some features had overlap by having both categorical and numerical feature representation. Masonry veneer is shown both as type in a qualitative feature (‘MasVnrType’) and in a quantitative square foot feature (‘MasVnrArea’). Merely dummifying the masonry veneer type would lose the paired masonry veneer sqaure footage information, so after dummification, I replaced any 1’s found in that dummfied masonry veneer type with that observation’s masonry veneer area. This type of feature engineering helped preserve and match the masonry veneer area so that the newly created dummified features are richer, showing distinction between the different units’ masonry. Simialrly, I was able to separate the information showing a shed was present on the property from ‘MiscFeature’ and assign it the shed square footage found in the ‘MiscVal’ feature, creating a useable and interpretable ‘Shed’ feature.
Building More Meaningful “Big-Picture” Features through Feature Creation & Compilation
Knowing the number of half-baths in a house’s basement is fine, but it is not a standard ‘feature’ one finds on any Zillow listing. More important to a potential homeowner is the total number of bathrooms (half and full) in a house – a feature I was able to create by combining 4 smaller existing features. Similarly, the finished square footage of a basement is helpful, but without context, it does not have meaning. Does a finished square footage of ‘0’ mean the basement is entirely unfinished, or that there is no basement in this house at all? By calculating the percentage of a basement that is finished, buyers and sellers have a marketable number that instantly makes sense. Creating big-picture features such as the ‘Age Since Renovation’ or the ‘Total Finished Living Area’ give buyers and sellers more compressed features from which to determine the sale value of a house – and thus, more useful features from which to build our model.
Filtering Noisy Classes by Frequency to Avoid Meaningless Dummified Features when preparing Categorical Data for Machine Learning Models
As most machine learning models require numerical data exclusively, dummification has become the standard go-to for transforming nominal categorical features. While dummification excels at preserving categorical data by assigning each class a new binary feature, it also can create a lot of noisy and/or unnecessary features that cloud a model’s efficacy. A quick glance at the ‘Heating’ feature from our data can show how unfiltered dummification can lead to unnecessary dimension expansion.
If each of these classes became a new feature, all 6 would likely contribute more noise than insights to a predictive machine learning model. In a training dataset with 1460 observations, one is unlikely to draw a reliable conclusion from only 18 or fewer observations – let alone 1 or 2. Creating a new dummified feature ‘Heating_Floor’ will at best create noise, and at worst bias the model for what might be a one-off occurrence. Uniformity can be just as noisy as scarcity. With 1428 out of 1460 observations using Gas from an air furnace (GasA), only 32 houses are using alternate heating sources – also a very low number from which to build model assumptions.
As it is rare to draw meaningful conclusions from a tiny number of observations (or a lack of alternative observations), I created a frequency filter that automatically replaces a class with too few or too many observations with null values. As sci-kit learn’s dummification package skips null values, this process will ensure that these classes will not be dummified. In order to avoid subjective bias in which frequency cutoff to use, I decided to filter classes using the square root of the total number of training set observations, leaving only classes with 38 to 1422 observations.
Of course, there is room for nuance if one has the time. Six out of the seven houses with a pool are above the median sale price, but is that data worth adding to a model? If another house goes for sale with a pool, there is no guarantee that it will also fetch a higher price – an older pool in disrepair can actually detract from a sale price. With such a small sample size, there is a greater chance of a model making faulty assumptions due to limited data, which is a risk I decided was not worth taking.
However, consolidation can save some classes that would have otherwise been nullified by the frequency filter. For GarageType, it would appear at first glance that basement garages (‘Basment’) would not be kept as a feature for dummification, since there are fewer than 38 observations. However, basement garages are just a below-grade version of built-in garages (‘BuiltIn’), and thus can be consolidated into the built-in class.
Similarly, consolidating the three Electrical fuse types into a new class ‘Fuse’ can create a more robust feature and avoid eliminating two classes. This is similar in philosophy to the ‘big-picture’ feature engineering I described earlier in this blog, and helps keep meaningful data in the machine learning model.
After using the frequency filter on the nominal categorical features and creating dummified features, I repeated the same process for ordinal categorical features. However, ordinal features often have intra-relational and contextual information among its classes that can make a numerical encoding approach worth considering. When using an ordinal encoder, the frequency filter is not applied, as classes are all contained within the original variable, just relabeled as numbers. In this case, I tried it both ways in order to test which would work the best for this dataset.
Eliminating Outlier Observations using Cook’s Distance
By this point, I have finished feature engineering and my data is machine-learning ready. However, while I have curated features and classes, I have not culled or modified observations beyond imputing median values for null values in my quantitative features. My next step was to trim outliers that could have an outsized impact on a machine learning model – especially given this dataset’s relatively small sample size. However, with over 80 features, simply looking for sale price outliers using a gaussian distribution would not encompass the larger picture – a house can be very expensive and justified in its expense by its features. The answer is to use a method that sees if the house price is an outlier to the model. I decided to use Cook’s Distance on a multiple linear regression model in order to gauge which observations were outliers.
Cook’s Distance (formula shown to the right) compares an observation’s residual values with its leverage in changing the linear prediction of the model. When both residuals and leverage are high, then the observation will have a high Cook’s Distance, showing that observation has an exaggerated influence on the model.
A typical filter threshold is to filter any observation whose Cook’s Distance is above 4/n (with n being the number of observations). This can eliminate egregious outliers as seen in the Residual plots below, reduce the range of residuals, create more uniform distributions, and improve R2 significantly.
We can also see how much more uniform the outlier influence is after applying a Cook’s Distance filter in the Outlier Detection plots. This process ultimately eliminates 80 observations from our original 1460. While this seems like a lot of trimming on an already small dataset, the improvement of R2 seems to justify the sample size decrease.
HyperTransformation Tuning: A New Approach for Machine Learning Model Optimization
Even with a smaller dataset, non-linear machine learning algorithms can take a long time to run on even a moderately equipped laptop. HyperParameter Tuning only exacerbates this issue, with GridSearchCV often taking over 10 hours to test several permutations of hyperparameters - and often, one needs to use GridSearchCV more than once. Furthermore, there is little intuition in HyperParameter Tuning that allows one to guess which hyperparameters will work for a given dataset, so every tuning is largely guesswork. Furthermore, if one is testing multiple models, hyperparameter insights gained in one machine learning model do not transfer to another model. As such, I wanted to find another approach that could get similar results without waiting hours for marginal improvements.
HyperTransformation Tuning is my answer to this issue. Instead of tuning a model using combinations of hyperparameters, one tunes a dataset using combinations of transformations. Most machine learning models are optimized to run most efficiently with their default settings, and testing variations of the dataset against a default model effectively tunes the data to the default model hyperparameters. Instead of trying 3 values for the max_depth hyperparameter of a random forest, one would try making datasets using 3 different scaling transformations. Unlike hyperparameters, dataset transformations can be used for multiple machine learning models, and can lead to universal insights - for instance, all datasets performed better across 10 machine learning models when ordinal features were dummified rather than ordinally encoded and when Cook's Distance was used to filter outliers. While the number of transformations to test can easily mulitiply just as quickly as in a GridSearchCV, one can prune along the way to keep the numbers of variations manageable, and even hundreds of transformations rarely take even 1/10th the time as HyperParameter Tuning does. Even with 10 Regression machine learning models being tested - including Linear Models, Optimization Algorithms, Tree-Based Methods, Support Vector Machine, and Boosting Models - my entire Jupyter notebook takes about 3 hours to run on my laptop, and the results were impressive.
As one can see, different machine learning models like different transformations - some had no scaling, some StandardScaler, some RobustScaler. Log transformations of the target data were found in 4 models, and dimension-reduction lists based on feature or permutation importance of different models were found in 9/10 best datasets. By testing these variations, I was able to get 10 models with Testing R2 scores above 0.923. With a 0.082 mean R2 test score improvement found from this method, one can optimize models to a significant degree using HyperTransformation Tuning alone.
Finally, by having multiple tuned machine learning models ready to go, one can compare models to find the best performer. CatBoost, to little surprise, comes out on top, but Ridge Multiple Linear Regression was an unexpected close second, and has the benefit of being a fully interpretable model (no black box). However, there is more one can do with these models, including stacking methods and/or ensembling. By taking the 7 best models and averaging their price predictions, I got a better RMSE (Root Mean Squared Error) than any one of the 7 models individually. While all 7 models performed well, using Ensembling to gain a consensus for price predictions yielded an outcome closest to the true prices. In this case, the sum can be greater than its parts. Without any HyperParameter Tuning, my HyperTransformation Tuning method combined with Ensembling was able to get a result in the top 15% of scores.
Which Features are Important?
Using Partial Dependence Plots, one can begin to see which features are standing out as important to a home's sale price. Living Area and Quality are the most important features overall to this model, but Kitchen and Garage quality make an interesting showing. These insights can help one decide where to invest in home renovation before selling, or in what type of housing to build in this market. In addition, as this project utilized several different machine learning models, I wanted to look at more than just one model’s idea of which features were important. By taking Lasso’s lambda influence, Random Forest’s Tree-based feature importance, XGBoost’s boosting-based feature importance, Ridge’s Permutation Importance, and an average importance score, I felt I could better observe if there were common features of note across each of these models. Beyond seeking consensus, it also felt important to use more than one model, as multicollinearity can lead to one model prioritizing ‘GarageCars’ and another prioritizing ‘GarageArea’. Ultimately, both show us that garage space is the common denominator that impacts the price of a house, which could be missed when only looking at a single model’s feature importance. Interestingly, in our Partial Dependence Plot, we can see almost no measurable distinction between houses with garages unless they can fit at least 2 cars. As seen earlier, I used these lists in my HyperTransformation Tuning to drop non-important features for dimension reduction as well, and some lists that led to the best scores were generated from a completely different machine learning model.
Here we can start to identify important features that one could tell a realtor or developer to focus on in a home. Total Finished Living Area seems near the top of importance, with all 4 models having it in their top 10. Having an Excellent Basement also is regarded as a strong indicator of a house's price, regardless of whether or not it is finished or unfinished, as those features were much lower on the list. Having multiple perspectives can help weed out features that one model might favor but others find less critical, such as total number of rooms above grade (not in a basement), which only Ridge's Permutation Importance found particularly important. As machine learning models can often be a black box, it is helpful to get consensus from several black boxes to gain confidence in which features a model deems important, before one chooses to remodel a basement or expand their finished living area.
Ultimately, this project aimed to develop comprehensive feature selection and observation filtering mechanisms, new machine learning approaches to optimize computational expense, and methods of consensus-building to help a buyer or seller of a home have confidence in our models' assigned feature importance. By doing so, I hope to have opened some doors to tools and ideas that will extend beyond this project as I continue in my machine learning career.