Beware of Feature Importance for Business Decisions
Machine learning (ML) prediction models arenβt just used to predict one particular outcome. Sometimes the goal is to understand how variables in a complex system interact to produce an outcome. In real estate, for example, investors might use a model that predicts home value to understand how a variety of home characteristics (aka features) like location, size and style influence home prices. Theyβd focus on feature importance (FI) metrics: how important each characteristic was to the final prediction.
Understanding FI is a common business case for machine learning. In fact, helping a house flipper improve their return on investment was one of the business cases our bootcamp suggested for ML analysis of a dataset of thousands of homes sold in Ames, Iowa in 2006-2010. Many past students came up with recommendations for house flippers by running ML models to predict sales prices, picking the βbestβ model, and then extracting a particular type of FI metric that they interpreted as an accurate representation of each featureβs contribution to home value.
I also ran a series of predictive ML models, including a linear LASSO regression and two tree-based regressions - Random Forest and XGBoost. They all accurately predicted home prices, but each model reported significantly different FIs, no matter which FI metric I used. Rather than assuming any of my models were correct, I sought to understand how models using the same dataset could all make highly accurate predictions but in very different ways.
The answer was simple: there are many different ways to reach the same answer, so models can make accurate predictions in inaccurate ways. This is by no means a groundbreaking realization. But itβs also not something I found to be stated plainly in ML literature.
Data scientists tend to focus on which FI metrics to use and how to use them to best interpret models rather than questioning if these metrics are accurate reflections of reality. Perhaps they assume their audience is already aware that FIs only describe how one particular model used features to predict a target variable, not necessarily how features actually contribute to the target variable in the real system being modeled. To be fair, model interpretability - understanding how a model made its predictions - is the goal in many ML use cases like medical diagnosis where it is vital to understand how a model might make mistakes.
Given the fact that my models disagreed on FIs, the next logical question was which model, if any, was correct. I ran a series of FI validation exercises to answer this, but none of them gave me conclusive evidence that one model was better than another. Ultimately, I concluded that FIs are often unreliable representations of the true importance of a feature to the outcome variable, no matter how accurate the modelβs predictions.
Inconsistent Importances
I started my work on the Ames dataset with all the general data science best practices to check, clean and transform the data. Then I added a few geographic features and ran two tree-based models (Random Forest & XGBoost) that predicted house prices accurately ( both had rΒ² over 0.9).
(For details of this work, see appendix 1.)
Next, I extracted FI metrics including the modelβs built-in metrics known as βmodel-intrinsicβ FI, and two βmodel-agnosticβ metrics: permuted FI and mean absolute SHAP values (to be explained in detail later.) I should admit that I was surprised to find that each metric returned quite different FI values from one model to the next. Different FI metrics for the same model also disagreed significantly on the importance of each feature.
Was Dimensionality to Blame?
Too many features is a common problem in Machine Learning. I thought the dataset might have too many features including features that were correlated enough to cause the models to rely on one feature in place of another.
A very rough data-science rule of thumb is to have at least 10x, and ideally 100x, rows to columns to give models enough examples to make accurate predictions. One-hot encoding, the process I used to change categorical (text) features into numeric features required by the models, caused a huge expansion in the dimensions of the dataset from roughly 100 to 400+ columns vs 2500 rows of homes.1
So I went back to the data and reduced the feature set by further removing correlated features and aggregating wherever possible. This time I added ANOVA and chi-squared tests to include categorical features in my correlation calculations, which allowed me to remove one of another handful of pairs of correlated features. I also logically aggregated similar features.
Instead of one-hot encoding, this time I target-encoded categorical features so each feature remained as one column instead of many.2 The result was a much more reasonable 45 columns to roughly 2500 rows without significant loss of information.
Unfortunately, this did not have the desired effect on FI. Prediction accuracy improved slightly - with an rΒ² of 0.94 for XGBoost and 0.91 for Random Forest. FI values changed because the features had changed after aggregation and different encoding. But the models continued to disagree significantly, as did different FI metrics per model.
Reality Thrice Removed
Dimensionality reduction clearly wasnβt the answer, so I started looking into other reasons why models might disagree on FIs and why different FI metrics for the same model might show such different results. I quickly realized that this was not just a question about my study or the Ames dataset. These sorts of disagreements are quite common in ML.3 There was a much broader and more fundamental question to answer: why are FIs often unreliable measures of both a featureβs true importance to the target variable and a featureβs importance to the modelβs predictions?
The answer to both of these questions is that FI metrics are approximations of approximations of approximations. The dataset approximates the real system; the model approximates the dataset; and FI metrics approximate the model. There are many ways data can fail to capture the system being modeled, models can mistake a featureβs true importance and FI metrics can inadequately interpret models. Any of these can lead to disagreements across models and/or FI metrics as well as simply inaccurate FIs.
Failings of Data
Since a dataset only captures certain measurable parts of a real system, itβs always an approximation. But for that approximation to be good enough to derive accurate FIs, the dataset must be sufficiently comprehensive, representative and accurate.
A comprehensive dataset covers all relevant features and enough instances (houses) to mathematically distinguish how each feature relates to the target variable. For the Ames dataset, and for nearly all real systems, perfect comprehensiveness is impossible because some features are difficult to measure and/or subjective like the attractiveness of the house to potential buyers, not to mention features that the data-gatherer is completely unaware of like say if the home has popcorn ceilings.
To be representative, the dataset needs to cover all major types of instances in the real system. For the Ames dataset, this might mean including very expensive homes where things like square footage or lot area have a very different relationship to price than the average home. The data only covered a small subset of all homes in Ames and we have no way of knowing if that subset was representative of the homes we might want to predict sale prices for in the future.
Models that use datasets that are insufficiently comprehensive, representative and accurate will clearly yield inaccurate feature importances. But shouldnβt each model get the same inaccurate feature importances if they all use the same insufficient dataset? The answer is not necessarily, and this brings us to how models can get feature importances wrong.
Failings of FI Metrics
Feature importance is actually a catch-all term for a variety of measures of how a change in a featureβs value affects the modelβs decision making process or predictions. In fact, different FI metrics calculate different things so they are usually only proportionally comparable. And much like individual models, each FI metric calculates FIs in a unique way, which can lead to each metric having a different view of FIs for the same model.
The first FI metric I used, known as βmodel-intrinsicβ FI, comes built-in to each model. For tree-based models, model-intrinsic FI metrics measure the degree to which a particular feature was used to create splits in the final tree. Specifically, it is a measure of the average decrease in prediction error caused by a particular feature.
Note that this metric is not reported in terms of the target variable β in our case US dollars β but rather a coefficient of importance for each feature. Moreover, model-intrinsic FIs in tree-based models cannot measure feature interactions and are susceptible to whatever splitting bias the models may have. As you can see from the chart below, model-intrinsic FIs differ from one tree-based model to the next.
For linear models like LASSO, model-intrinsic FI measures something entirely different: how a unit change in a feature affects the target variable value (i.e. the coefficients the model assigns to each feature). As a result, model-intrinsic FI is not directly comparable between tree-based and linear models.
The second FI metric I used is called permuted feature importance. Permuted FI is calculated by permuting, or randomly shuffling a featureβs values, one feature at a time and re-running the model. The difference between the original prediction accuracy and the prediction accuracy with the permuted feature is that featureβs importance.4
Like all FI metrics, permuted FI is subject to most of the biases and limitations of whatever model it describes. Moreover, like model-intrinsic FI, it cannot gauge how the model used feature interactions.
The last way I looked at FIs was through Shapley Additive Explanation, or SHAP, values. SHAP values use game theory to measure a featureβs relationship to the target variable. They aim to accurately distribute the βpayoutβ (the change in prediction value) among the βplayersβ (the features).
SHAP is more granular than other FI metrics: for each instance (home) in the dataset, it runs every possible permutation of each feature through the model. As a result, SHAP can accurately derive each featureβs impact on the target variable in instances when multiple features interact to produce a single target variable value.
(Please see appendix 2 for an in-depth explanation of SHAP values and feature interactions.)
SHAP values are per instance, not per feature like the other FI metrics, so they must be aggregated by taking the mean of all absolute SHAP values of a particular feature in the dataset. While this aggregation does remove detail, SHAP values are still the best measure of FI available because they include feature interactions and are reported in terms of the outcome variable. Moreover, the SHAP library allows you to see feature importances at the most granular level of a single instance, so it is more versatile than other FI metrics.
That said, SHAP values are still based on the model so they will always reflect the biases and limitations of whatever model they describe. This is why mean absolute SHAP values did not line up across the three models I used.
Feature Importance Validation Techniques
Thus far Iβve described how and why FI metrics can be unreliable measures of the real systems being modeled. But they are also the best measures available, so itβs our job as data scientists to use model, or FI, validation techniques to identify which models deliver the most accurate FIs.
I tried a number of validation techniques: checking for splitting bias by examining tree structures; an FI learning curve to see if the models had enough data to make conclusive calculations of FIs; a sensitivity analysis to see how each model reacted to small changes in the training data; and a deep dive into how each model saw particular features to see if I could find any obvious flaws using logic and domain knowledge.
Splitting Bias Detection
Splitting bias can occur when multiple features have a similar relationship to the target variable. This is the case with groups of highly correlated features, which is why data scientists tend to remove all but one feature in the group. But splitting bias can also occur as a result of substitutable features - features that have only moderate correlations but can act as substitutes for one another in specific tree nodes. Unlike highly correlated features, substitutable features only have a similar relationship to the target variable some of the time, so it does not make sense to remove one of them from the dataset as each feature has unique information for the model to use.
I uncovered a number of substitutable feature pairs in the Ames dataset. I wrote a function to exclude one feature at a time from the dataset, run the model with that feature excluded, and calculate FI metrics. When one feature of a pair of substitutable features was excluded, the other feature would show a much higher mean abs SHAP value than normal.
Take, for example, the substitutable feature pair of βyear builtβ and βoverall qualityβ that only have a moderate correlation of -0.4. When year built was excluded, there was a significant uptick in the importance of overall quality. This proves that splitting bias could have been the cause of the disagreement in FIs between my tree-based models.
Splitting bias tends to be more likely in Random Forest than XGBoost because of the different ways each model seeks to minimize prediction error. XGBoost uses a process called βboostingβ that creates trees sequentially, with each new tree meant to correct the largest errors from the previous tree.
Random Forest doesnβt work sequentially; it creates multiple trees all at once. Each tree is created through a process known as βbootstrappingβ in which rows of data (homes) in the training set are randomly selected with replacement, usually resulting in about two-thirds of instances in the training dataset present in each tree because some instances are used multiple times and others not at all.
XGBoostβs approach usually results in a broader mix of features used in higher nodes, while Random Forestβs approach is more likely to rely on fewer features in higher nodes. This is precisely what happened in my models.
Unfortunately, this is not proof of splitting bias in the Random Forest model. Proving splitting bias is extremely difficult if not impossible, so I consider this to be one piece of evidence against using the RF modelβs FIs. On the other hand, the three features the Random Forest model may have overused make a lot of intuitive sense: a homeβs size (total finished square footage), quality and neighborhood do tend to be the most important things to potential buyers.
Feature Importance Learning Curves
Learning curve charts usually show how an ML modelβs performance (y-axis) changes as the amount of data (x-axis) used to train the model increases. When models have enough (representative and accurate) data, their prediction accuracy tends to level off so that adding data doesnβt have much of an effect. I adapted this format to create FI learning curve charts for each model, illustrating changes in mean absolute SHAP values as the training dataset increased from a small share of the original training data to all of it.
For both models, most featureβs mean absolute SHAP values stabilized to some degree with more data, but never flattened out. In other words, the models likely didnβt have enough data to make firm FI conclusions, so the conclusions they came to arenβt all that reliable.
Feature Importance Sensitivity Analysis
Next I checked how sensitive each model was to small changes in the training data. This time I fed the model the same number of instances as were in the full dataset but with bootstrapping.
The chart shows the mean (point) and standard deviation (error bars) of each featureβs mean absolute SHAP value over 500 runs of the model. The small standard deviations suggest a low sensitivity to small changes in training data, and the chart was similar for both models. In other words, both modelsβ FI conclusions would have been roughly the same no matter how the data was split for training and testing, so neither were particularly sensitive.
You might be wondering the same thing I was when I first made this chart: how can the models form unstable, but also consistent, conclusions about FIs? The answer is that the models were consistently wrong about FI to some degree. If they had enough data to form more stable conclusions, the learning curves would show a flattening out and the mean values of 500 bootstrapped iterations would be different - more accurate - values, but equally - or likely even more - consistent (small standard deviations).
Individual Feature Analysis
My final method of model validation was the old-fashioned smell test: did one of the models show FI values that just didnβt make sense? I started the process with charts of the most important individual features vs their SHAP values for both tree based models. In general, the models showed significant disagreement, especially at the higher (usually meaning better) ends of each featureβs value. This makes sense as large and/or top-quality homes tend to be rare, have lower value per unit size, and have buyers who want different things than the average home buyer.
For the most part, the models agreed on the general trend (positive or negative) of each featureβs relationship to the target variable. I only found one feature, total basement square footage, where one model showed a positive relationship to home price and the other negative.
I took a deeper look into this apparent error by plotting SHAP dependence plots (as above) but this time just for the XGBoost model and with the color of each point determined by another variable that had high average interaction values with basement size. I created two charts, one with overall home quality as the interacting feature and another with total finished square footage as the interacting feature since these features had the largest interactions with basement size. The overall quality colored chart shows that, at any given basement size below 1500 sq. ft., lower quality homes tend to add more value than higher quality homes - again, a counterintuitive relationship. The chart colored by total finished sq. ft. showed no clear trend.
Next, I checked SHAP interaction plots, which represent the combined impact of two features on the prediction beyond what would be predicted by each feature on its own. This chart shows that, beyond the independent SHAP values for these two features, for any basement size below 1600 sq. ft. or so, having a low quality home adds an additional $1,000 to the home value, having a medium quality home subtracts roughly $1,000 and having a high quality home subtracts roughly $2,000.
Again, this is clearly counterintuitive. Interestingly, the Random Forest model showed a similar interaction relationship for these factors where having a high quality home and a basement size lower than 1600 sq. ft. subtracts around $5,000 from the home value beyond how these features would each affect home values alone.
There are a number of potential explanations for these relationships. Itβs possible that buyers like a balance between quality and size and that high quality doesnβt compensate for small basement sizes. A similar relationship is evident in the interaction values of total finished square feet and overall quality. It could also be due to the influence of other features like neighborhood, or unknown factors like if the basement was finished or not or when the home was built.
Ultimately, it may not be possible to tease out exactly why the model found this specific relationship. But this single, apparently counterintuitive relationship, certainly doesnβt negate the validity of all XGboost FIs.
What the example does provide, however, is a good final illustration of why FI metrics can be unreliable. The relationships of features to other features and the target variable are not causal and can have any number of explanations, especially when using more complex models like tree-based ensembles or neural networks. In other words, just because we know the relationship of a feature to the target variable doesnβt mean we know why that relationship exists. Even with SHAP, which does a good job of accounting for interactions and properly distributing value to show the true relationship of each feature to the target variable, there can be counterintuitive relationships that we canβt explain but may still be correct.
Feature Importances Are A Hypothesis, Not A Conclusion
For this particular dataset, the evidence suggests that FIs derived from ML models are not reliable representations of reality, no matter how accurate the modelβs predictions. This is not just a pedantic data scientist conclusion. No business should use one of these models to make investments based on expected ROI from particular home improvements because the models come to wildly different conclusions.
Hereβs an example of what the three different models found for the ROI on three types of home improvements. Just looking at homes valued between $100k and $200k (most homes in Ames at the time), a kitchen upgrade from good to great quality could return anywhere from $2,100 - $9,400 dollars; adding a third fireplace could return anywhere from $34 to $6,000; and improving the basement from good to great quality could return nothing or $1,300! SHAP values, which take all potential interactions into account, show similarly huge differences in ROI per model.
In a broader sense, we can conclude that accurately predicting the target variable is just one prerequisite for a model to have accurate FIs. Getting FIs right is orders of magnitude more difficult and requires much more data than accurately predicting a single target variable. A full explanation would involve complex math, but it really comes down to three important points:
- FI and predictions are not synonymous: FIs are akin to how a system works, while target variable prediction is akin to what a system does.
- There are many more features than the predicted outcome of the single target variable they are used to predict. Especially with complex models that attempt to gauge feature interactions, each feature can have complicated, non-linear relationships with other features which exponentially increases the amount of understanding required.
- The target variable of home prices has a ground truth for a model to test against and to guide adjustments. In contrast, datasets donβt come with examples of correct FIs.
If data scientists aim to use FIs to explain the system being modeled, rather than just the model itself, then FIs should be treated as hypotheses in need of validation. Unfortunately there is nothing as precise as statistical methods like the t-test or p-value to understand if a modelβs FIs apply to the larger reality being modeled. But there are still some decent validation techniques available.
The first and best test is to create multiple models and compare them. If the dataset is comprehensive, representative and accurate, the models predict accurately and their FIs align, thatβs a decent sign that the modelβs have generalizable FIs. If models differ significantly in FIs - and they will always be different to some degree because they work differently - Iβve outlined a number of other methods to try to validate FIs like:
- Removing highly correlated or redundant features to assure that the model assigns importance correctly to each feature.
- Model bias analysis like investigating tree structures for evidence of splitting bias.
- Analyzing the consistency of FIs using FI learning curves.
- Gauging the sensitivity of FIs to small changes in training data by running multiple model iterations with bootstrapping.
- In depth analysis of the relationships between individual features and features and the target variable to see if one model got things wrong.5
Itβs also important to note that this dataset was quite small and static. In real businesses, datasets tend to be much larger and continually expanding as new data is collected. That means models can be rerun regularly with additional data to shore up FI conclusions. If possible, it is also wise to validate FIs with isolated experiments (e.g. a/b tests) and causality analysis.
Appendix 1: Best Practices & Accurate Predictions
I started my work with an exploratory data analysis (EDA) including all the necessary checks, data cleaning and transformations. First, I explored the target variableβs distribution (skewness and kurtosis, or shape) as well as various featuresβ relationship to the target variable. Home price was abnormally distributed and many features had a nonlinear relationship to home price, which suggested linear models were not a good choice. Most of this was solved, however, by a simple logarithmic transformation of the target variable. I explored other features to get a better understanding of the data, and to find and remove outliers. I looked into feature correlations in a variety of ways (correlation heatmaps, pairplots, etc.).
Most of the cleaning for this dataset involved dealing with missing data. I was able to impute (fix or estimate) most null or missing values, though I did need to remove some rows (homes) entirely. I also removed highly correlated numeric features that might confuse models along with features with very low or no variance that are essentially useless to a model. To gauge multicollinearity (when more than 2 features are correlated), I calculated the Variance Inflation Factor (VIF). Significant multicollinearity and the prevalence of nonlinear relationships made tree-based models the obvious choice.
Transformations (changing the form or format of data) primarily involved encoding categorical variables, meaning changing them into numeric variables so they can be used in models. Many categorical features were levels of quality like excellent, good, fair, poor. These could all be label-encoded by replacing each level with a number (1 is worst). For true/false, yes/no type features, I used binary encoding (a zero or a one).
Finally, I added a few geographic features. I found an R package with longitude and latitude data for each home and added those as features as well as a feature of each homeβs distance from the center of the city. I also did some geospatial analysis (including K-means clustering) to see if there was a better way to spatially group homes than the given neighborhood feature. However, I didnβt find anything compelling.
Next up was modeling. I built pipelines to one-hot encode the remaining categorical features. Then I ran two models - Random Forest and XGBoost regressors - with the proper precautions and optimizations like cross-validation and grid search to tune hyperparameters. Both models returned very high rΒ² prediction scores of 0.91 and 0.93 respectively. For those unfamiliar with rΒ² values, that worked out to models with average predictions less than +/- 10% off true home values.
Appendix 2: SHAP Values & Feature Interactions
A featureβs SHAP value is the amount by which it raises or lowers the prediction value from the average target variable value, known as the base value. For each instance in a dataset the SHAP value formula is:
The waterfall chart is a common way to view this formula in action. As you can see, the chart begins from the bottom at the average Ames home value of $180,000 and shows how various features of this particular home add to (red), or subtract from (blue), that base value to finally reach the modelβs predicted price for that home of $153,000 at the top. Each featureβs value is also given, e.g. this home had an Overall Quality score of 6 (out of 10) that caused its predicted value to decrease by about $8,200 from the base value.
Note that the SHAP values include feature interactions. To explain feature interactions, letβs take a simplified, hypothetical example of a model with only two features: kitchen quality and total square footage. At a kitchen quality score of 4 out of 5 and a total square footage of 2,500, the model predicts the home price would be $200,000. Now letβs assume the kitchen quality score is updated to 5/5, and the model updates its home price prediction accordingly to $210,000. One might then conclude that the increase in kitchen quality raised the home value by $10,000.
Unfortunately, things are not often that simple. A 2,500 sq ft home is fairly large, so top quality kitchens could be more valuable to its potential buyers than they would be to frugal buyers interested in small homes with adequate kitchens. For this home, the model might attribute $7,000 of that $10,000 increase to the kitchen upgrade and the other $3,000 to the total square footage to account for the interaction value. The SHAP value calculations would come to the same conclusion.
Itβs important to point out that SHAP values arenβt abstractions that remove information from the modelβs logic like other FI metrics. They just describe the model from the specific point of view of the impact of each feature on the target variable. In other words, the modelβs logic includes all interaction values; SHAP just restates the modelβs logic in terms of each single feature and instance.
SHAP values are per instance (home) while other FI metrics are per feature. Ultimately, all metrics valued on a per feature basis are aggregations of some sort, like an average of a featureβs reduction in prediction error in permuted importance or just a βbest fitβ line in a linear model. SHAP values are most commonly viewed in a semi-aggregated format called a SHAP summary plot.
For a particular feature like total finished square footage, the summary plot shows the amount the feature added to, or subtracted from, the base value for every instance in the dataset. The SHAP summary plot from the XGBoost model on the left shows total finished square footage subtracted up to about $50,000 from the base value for some homes and added up to $180,000 for others.
To get the average impact of a feature on the target variable for all instances in the dataset, you simply average the absolute value of all SHAP values for that feature. This tells us how much a feature like total finished square footage influenced home prices on average, but not if the influence was positive or negative.
Footnotes
1 One-hot encoding turns categorical features into binary numeric features. For example, instead of one feature (column) like house style with categories like ranch house, split level and bungalow, one-hot encoding makes 3 features (columns) - one for each house style - and marks homes (rows) of that style with a one and all other homes with a zero.
2 Target encoding replaces each category of each feature (e.g. ranch house in the home style feature) with the average target variable (home price) value for all homes in that category (all ranch houses).
3 See here or here, for example.
4 Unlike model-intrinsic FI metrics that are calculated differently for each model type, permuted FI is a βmodel-agnosticβ metric because it is calculated in the same way regardless of model type. Model-agnostic FI metrics are often misunderstood to mean that FIs are somehow calculated separately from the model so they will get the correct FIs regardless of the model used. This misconception is rampant across forums like stackexchange.com and betrays a widespread and dangerous misunderstanding of these metrics.
5 It may also be possible to use simpler, linear models with only a few uncorrelated features to get a better understanding of how these features impact the target variable. That is one method I would look into for future work.