The Notebook and the Black Box: ML in Ames
In machine learning, there is always a tradeoff between a model’s interpretability and its predictive power. The most accurate models are tough to explain how they work, and the most explainable usually aren't as accurate. I want to push at both ends of this tradeoff. I want to make my interpretable models more accurate, while making my powerful models more explainable.
While studying the Ames Dataset from Kaggle, I imagined myself as a realtor in Ames, Iowa setting up a business. I then imagined I would want two models for predicting prices. First, one that was highly interpretable. I wanted a model that sacrificed some accuracy so that an intern could work it out on paper. Then, I wanted a ‘proprietary model’ back at headquarters that was more accurate, if less explainable. You can explore my code for this project here.
Data Preparation
Missing Data
Of the 2580 rows and 81 columns in the data set, 11 numeric and 16 categorical columns had missing data. Most of the data was missing at random, meaning missing because of some other aspect of the dataset we know. For instance, the number value of basement bathrooms was NaN in cases where the home didn’t have a basement. Other values that were missing required a combination of modal and mean imputation.
Feature Engineering
For feature engineering, I referenced some of De Cock’s suggestions and others.
Assess Multicollinearity
While assessing multicollinearity, we can observe that some variables are highly correlated, like Garage Cars and Garage Square Footage. Others, thanks to our feature engineering that combines multiple variables, will be multicollinear with the original ones. We’ll need to be sure to explore models like Tree-based models that handle multicollinearity well.
Approach to Outliers
For the most part, my approach to outliers was to leave them in as much as possible. Several models I explored are robust to outliers, so I wanted to keep them and see how the performances changed. One clear truth from studying this dataset is that as house's size goes up, it’s harder to gauge the price. Nevertheless, the real estate market is the game we’re playing, so I mostly opted to leave outliers in. One exception is for the multiple linear regression model, which I’ll go into detail about later.
Preprocessing
For preprocessing, I used various functions from sci-kit learn. These functions included OneHotEncoder, OrdinalEncoder, and StandardScaler. I combined these preprocessing steps with ColumnTransformer, and made this transformer the first step of a Pipeline to begin modeling.
Modeling
Simple Linear Regression
Simple linear regression is a good starting point for my intern’s notebook model because of its interpretability. Take, for instance, this model from a train-test split. Mapping high quality square footage at a rate of $77.22/sq ft explains about .7 of the house prices in Ames.
Running 5-fold cross-validation on each variable, you can get a good idea of top performers. While you can get high scores, a single variable won't be enough for a basic model. My next step was to move into multiple linear regression.
Multiple Linear Regression- SequentialFeatureSelector and Overall Quality
Using SequentialFeatureSelector didn't go as smoothly as I had hoped. I struggled with the prevalence of the variable 'OverallQual'. This variable is a scale of 1-10 that De Cock describes as “overall material and finish of the house”. While it is a powerful descriptor for the set as it stands, it's difficult to replicate. I can't have my intern ask a client, “Rate your own home’s quality on a scale of 1-10” and have that be a huge driver of the estimate, or even calculate it myself. My process for creating an interpretable MLR versus a strictly accurate one was going to involve some more experimentation.
Multiple Linear Regression- More Experimentation
One successful experiment involved using SequentialFeatureSelector to predict OverallQuality instead of SalePrice. After committing to high quality square footage and neighborhood, I fed a few of these variables that make up OverallQual into a function I made called my_mlr. I also limited the set here to where the sale condition was 'Normal'. I'm looking for simplicity and economy of variables, but also interpretability, and striking this balance is subjective. My proposed_notebook_list is a solution that biases towards straightforward answers and math for my intern to do in their notebook with a customer.
Multiple Linear Regression- "The Notebook Model"
Here's my proposed model after performing 5-fold cross-validation and averaging the coefficients:
SalePrice = -2857 + Neighborhood + 51.4HQSF + 10984.98TotalBath + -7668.78BedroomAbvGr + 11576.04FireplaceYN(0/1) + 11480.6GarageCars
This formula can estimate sale price with an R2 of about .83. With all other variables held equal, a one unit change in a variable will change the price by that coefficient. For instance, expanding a garage from one car to two will raise an estimate by about 11000 dollars. However, the negative coefficient for Bedrooms Above Grade may seem unclear. Remember that this coefficient reflects the estimate with all other factors being equal. If you add more bedrooms to a house without a rise, in say, bathrooms, the house may be more cramped or out of proportion.
Instead of listing all the neighborhoods as Boolean switches here, the idea is that the intern would plug in a number from a separate neighborhood list, effectively putting a 1 before that coefficient and a 0 in front of the others. By dropping 'Meadow View", the neighborhood with the lowest average sale price, the other neighborhood's coefficients represent a general idea of how their average sale prices are related.
Taking the log transformation of sale prices would make them more normally distributed, thus better for linear regression. But raising the R2 by 1% to .84 would sacrifice the explainability of our coefficients.
The RMSE for this formula is almost 30000 dollars, which is not great. And our residuals show some slipping from constant variance, particularly as the prices and sizes go up. But it's highly interpretable, decently accurate, and a good benchmark to compare against while exploring other models. I then moved into experimenting with models that are less directly interpretable, but are hopefully more accurate.
Dropping Quality
As an experiment throughout, I dropped 'Overall Quality' from my assessments. While I lost about 1 percentage point in the R2's of my models across the board from this decision, a step that may feel silly to some of you, I feel a lot more confident that my models are more replicable, and hope that the variables that otherwise make up this feature will surface in my evaluations, thus making my models more immediately interpretable.
Penalized Linear Regression
PLR requires scaling the numeric data and tuning a penalty term called alpha. We gradually phase out the coefficients that contribute least to the overall results. We lose the explainability of the coefficients, but receive a more accurate model that is robust to outliers. Lasso edged out Ridge and Elasticnet in my tuning process. The l1ratio value of .99 here for Elasticnet, effectively stating "Make a model that's 99% Lasso and 1% Ridge", suggests that Lasso is the top choice.
Support Vector Regression
Support vector regression requires scaling, dummifying, and ordinal encoding. While it certainly would have fit the bill for being hard to explain, it didn't outperform PLR (with an R2 of .9097) and required lots of tuning. This model didn't seem worth the work. With PLR earmarked as our most accurate model so far, I quickly moved on to tree-based models.
Tree-Based Models: Random Forest
Tree-based models are based on the concept of a decision tree, where you filter the data through a series of splits. You evaluate one column's value with an inequality for each observation while shooting for 'purity' at the endpoints, meaning that you want homes with similar prices to end up in the same leaves at the tree's end. Tuning hyperparameters involves experimenting with the limits, sizes, and number of trees you create as part of your model. While an individual tree doesn't respond well to new data, averaging the performance of a bunch of trees (particularly on random subsets of both the data and the features you split on) often does. This is the premise of the first tree-based model I explored, RandomForest.
While tuning hyperparameters throughout this process, I used both GridSearchCV from sklearn and an automatic tuning framework called Optuna. After tuning, I was surprised at how poorly the Random Forest was performing! I think the answer to the model's underwhelming performance goes back to foundational EDA. From the beginning we could see that linear models performed on this data well. While Random Forest is powerful, it will never beat a linear model at modeling a relationship between variables that is, at its core, linear.
Tree-Based Models: Gradient Boosting
My Gradient Boosting model came out the most accurate, so I will take a moment to explain its process. You still train decision trees, but not like with Random Forest. With Random Forest, we took many trees and averaged their performance. With gradient boosting, we instead train a tree and then train the next tree on the results of that first tree. Doing this process repeatedly is called sequential learning.
If you want to use Gradient Boosting to predict sale price, first take the average of all the sale prices. Then, train a tree on the differences between the price and this average, values we call 'pseudo-residuals'. Then, multiply the residuals by a learning rate, a coefficient that slows down how much the model learns from the tree. Change the predicted values to the values guessed by the weakened tree. Then, we start the whole process again, training a tree on the new pseudo-residuals and multiplying it by the same learning rate. This process continues incrementally, allowing us to make powerful models. Our tuning process involves the tree's dimensions and numbers like before, but also this new learning rate.
With an R2 of .9197 after tuning, GradientBoosting takes the lead as the most accurate model I could make, and thus my official suggestion for my realtor's 'black box' calculator. As we assess the performance of tree-based models, we should continue to look at residuals, but also to take a close look at something called feature importances.
Tree-Based Models: Feature Importances
The most important features don't have a 1:1 unit relationship like they do in multiple linear regression, but rather these features appear the most as the splitting criteria for decision trees, thus meaning they contribute the most to the model's ability to make accurate predictions. The importances of all the variables add up to 1, so you can think of each value as a percentage of overall contribution to the trees. At the top of our importances for GradientBoost are variables related to square footage and two 'quality' variables. RoomsxBath is interesting to see at the top here too, suggesting that the proportion of bathrooms to rooms in total is a key determinant in predicting sale price.
As a way to highlight what makes GradientBoosting so powerful in this instance is to compare its top 20 feature importances to the top 20 for Random Forest. We can see here that Random Forest is a bit of a blunter instrument, using HQSF 35% of the time as the splitting criterion. Gradient boosting allows for splits to be contributed evenly by more variables, helping to unearth the subtle differences between properties.
Conclusions and Next Steps
Machine Learning allows us to create models that are both incredibly explainable and incredibly accurate, even though there is always a tradeoff between these two qualities. With it, we can create multiple linear regression models that an intern can do at a notebook at a customer's kitchen table with a success rate of about 83%, and highly powerful calculators that can predict with a power of more than 90%. Balancing explainability and accuracy is one of the most satisfying, creative aspects of machine learning.
My next steps for this project would include emailing Dean De Cock. If I could know more about how 'OverallQual' was calculated, I would feel more comfortable including it in my work. Then, I could improve my models' performances across the board by about 1%. I'd also like to explore binning the neighborhoods into a smaller number of categories.
Thanks for reading. Check out my project about today's stock market or my tips for solving the New York Times crossword puzzles!
Check out my video presentation: