Certainty in Healthcare Fraud Detection: Peering into the Black Box
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Healthcare in the United States suffers from an increasingly arcane ecosystem of insurance reimbursement requirements and nebulous medical fees, rendering what should be a simple transaction into an angst-inducing mess for all parties involved, be they patients, clinicians, or insurance providers based on the data collected. One major source of this angst is the opportunity and imminent threat of Healthcare Fraud.
In an ideal world, insurance companies would regularly discover and eliminate all actual sources of fraud, as fraudulent claims drive up expenses for insurance companies that are often passed on to patients in the form of higher deductibles and co-pays. However, the reality is that fraud investigations and litigation are both expensive pursuits, whose expenses may inflate premiums in order for insurance companies to make their bottom line. As such, the act of healthcare fraud often harms the patients that the medical insurance industry should be protecting (see IJAM, Dippolito et al., 2016).
Our Goal & Solution
In order to mitigate this crisis, we developed a comprehensive model to not only detect healthcare fraud with a high degree of accuracy, but also a clear degree of certainty. By minimizing False Positives through consensus and False Negatives through rebalancing, we can create assurances that the costs of our fraud investigations will bear fruit. This in turn maximizes the recovery of fraudulent payments and avoids the financial and professional strain of a false accusation. Finally, we will be able to provide advice for our clients regarding what alterations to their data collection practices may better serve future modeling attempts.
Observations & Critique of the Data
Our data come to us in several pieces, which we will have to collate and combine. These consist of:
- Beneficiary Information: 138,556 Beneficiaries
- Inpatient Claims: 40,474 Claims
- Outpatient Claims: 517,737 Claims
- Providers Flagged as ‘Potentially Fraudulent’: organized by Provider ID codes
NB: These data describe Medical Institutions as ‘Providers,’ which does not reflect modern terminology; a Provider, according to modern medical descriptions, is a healthcare clinician, such as a Doctor, Nurse Practitioner, or Physician’s Assistant. For the duration of this post, we will use the data source’s definition of Provider as a medical institution.
As we familiarized ourselves with the data, we began to see areas of concern. In order to validate our thinking and learn more, we consulted with family members and friends in the medical field, who were able to shed light on these areas, as well as point out additional missing elements that they felt should be in the data.
In summary, the vast majority of the features we have are related to the claim submissions and beneficiary information, with very little information provided about the medical institutions - which are the source of the medical fraud we are investigating. We would recommend to our client to seek out more detailed features about said institutions - both to improve our model and to discover additional insights on which types of medical institutions tend to lean towards fraudulent behavior.
Furthermore, identifying the scale of previous fraud can help give context to whether fraudulent practices are endemic within a medical institution or a one-off event. However, with some careful feature engineering, we can use what we have to meet our stated goal regardless.
Combine & Conquer: Preparations for EDA & Processing
The first steps of our processing will be to render our data into a coherent, manageable format. To accomplish this, we approached the process in a careful, stepwise fashion.
Step 1: Combined Claims
Generate Inpatient Information:
- Length of Stay
- Per Diem Cost
- Admission Code Match
Combine Inpatient & Outpatient Claims
Generate Features for All Claims
- Length of Claim
- Claim Starting Month
- Number of Codes in a Claim
- Number of Physicians in a Claim
Step 2: Beneficiary
Merge Beneficiary Information to Each Claim & Generate Beneficiary Features:
- Age at Visit
- Percentage of Annual Deductible used for Claim
- Percent of Annual Reimbursement for Claim
- Percentage of Claim Amount Paid by Insurance
Aggregate Beficiary Information
Create Features About All Visits in a Year
- Number of Visits, Providers Seen, Physicians Seen
- Number of Inpatient vs. Outpatient Claims
Step 3: Provider Information
Aggregate Provider Information
Create Features for Each Provider
- Number of Claims
- Number of Claims by Month
- Amount of Unique Patients Seen
- Number of Physicians at Provider
- Total of Unique Diagnostic Claim Codes Used
- Number of Unique Procedure Claim Codes Used
Step 4: Final Aggregation & Imputation
- Aggregate All Binary & Numerical Data by Mean to Produce New Features for Xtrain DataFrame
- Locate Null (NaN) Values in Data
- For Inpatient-related Missing Data, Impute "0". NaN allowed if no admission
- For all other Missing Data, Impute Median Value
- Locate Infinite Values in Data; Impute Median Value
EDA: Charting a Course on a Sea of Data
With the majority of our pre-processing out of the way, from here we can get a good sense of the data. As a matter of form, we can inspect any correlations between our data. However, between the respective correlation and missingness correlation heat maps, we can see that there is little unexpected correlation.
Beneficiary Information - the Problem of Biased Data
From there we turned our attention to examining the beneficiary data, considering that it comprises a large portion of the information we received. Much of what we found supported some of the weaknesses of the data that we had observed from the beginning.
Claims Demographics by Gender & Race
Claimant Ages at Start of Claim
In the generalized breakdown of our claims demographics by Race & Gender, we can already see significant demographic imbalance. Looking at our populace by way of age, we can also see that the vast majority of our claims populace is significantly older. These two indicators of demographically imbalanced beneficiary information highlight some specific disparities within the data.
- Many Americans only see greater healthcare visits - and therefore generate more claims - after retirement, suggesting a strong relationship with Medicaid benefits.
- Employing demographic information to infer relationships based on race and gender can often lead to bias and inaccuracies, as data drawn from a systemically biased system will invariably yield systemically biased results. While keeping these features in this type of general application can constitute poor ethical practice, for the sake of this exercise and for a forthcoming example, we left these data in.
Inpatient vs. Outpatient
Turning our attention at this point towards the difference between Inpatient and Outpatient data, we first looked towards one of the more prominent elements of the data: chronic conditions. While we expected a higher rate of chronic conditions among hospitalized patients, we were surprised at how much more fraud was found in an inpatient setting. It is hard to say if this relationship is causal or due to correlation, but we now know that Inpatient settings are much more likely to have fraudulent behavior.
Discussion & Predictions
Given the dearth of information on medical institutions, we realized that the best source of feature generation for our model would come from the beneficiary data. By merging in the data that represents beneficiaries’ relationships with their “Provider,” we can learn about the overall patient population seen in a medical institution and more details on the visits that create our insurance claims. To this end, we postulated that our predictive model would find significance in the following variable types:
- Financial Data
- Provider Population Data
- Beneficiary Demographic Data
HyperTransformation Tuning: Fitting the Data to the Model
HyperTransformation Tuning is a novel approach developed by David Gottlieb that looks for which transformations of the dataset lead to the best scores for a default Machine Learning model or models. In summary, instead of testing multiple variations of a single machine learning model, one tests multiple variations of dataset transformations.
We chose to use this approach because:
- Most machine learning models are optimized to run on default settings
- Transformations are significantly faster operations than HyperParameter Tuning
- It does not lock us into a single machine learning model or method
- It enables the use of stacking or ensembling techniques to build consensus and increase certainty in predictions
In the end, we chose 4 different transformation approaches to test against 7 distinct machine learning classifier models. These 7 models incorporate a wide range of classification machine learning algorithms, including boosting, logistic regression, k-nearest neighbors, linear discriminant analysis, naive bayes, and support vector machines.
0 - Unprocessed
1 - Scaled
2 - PCA (Scaled + PCA)
3 - PCA Boosting (Scaled, PCA-Transformed, re-merged to original dataset)
To test if the HyperTransformation Tuning’s ‘default’ model approach could replace HyperParameter Tuning, we also applied HyperParameter Tuning to each of the 4 transformations in order to create an ‘optimized’ version.
What these results showed was that HyperTransformation Tuning was a great tool to identify which dataset transformation was optimized to a specific ML model. Only 2 of the 7 models shared a best dataset transformation, showing how critical it is to test different dataset treatments for different models.
Furthermore, HyperParameter Tuning was arguably not necessary to improve the results of these HyperTransformation-treated models, as the mean accuracy test score improvement was only 0.0009 for every non-Naive model. Ultimately, all of this optimization was helpful in generating models with excellent accuracy for each of our 7 classifiers.
Accuracy Versus Reality
When looking at the percentage of predicted fraud amongst our testing holdout set, it quickly became apparent that all of our 7 classifiers were likely under-predicting the true incidence of fraud. Our original dataset had 9.35% of ‘Providers’ listed as fraudulent, but our 7 models all predicted only 2% - 6.36% as fraudulent - a 32% - 79% reduction in fraud versus what we would expect to find. While all of these models’ predictions of fraud have very high accuracy, there are many incidences of fraud likely being left undetected (false negatives).
This type of under-sampling is generally indicative of an imbalanced training dataset, which held in this case - only 1 in 11 samples are fraudulent in our original dataset. In order to correct this, we used SMOTE, which generates new non-facsimile fraudulent observations using the mean of actual fraud observations and a randomly chosen neighbor, in order to bring the ratio of fraudulent and non-fraudulent observations to a balanced 50/50 split. We then ran the entire HyperTransformation tuning a second time, with the balanced dataset, which had mixed results (see fig. below).
For our boosting algorithms and support vector machine, the re-balancing of the dataset worked flawlessly, leading to a predicted fraud percentage that was within roughly a percentage point of our original training data. However, our other 4 models now suffer from the opposite problem, and are severely over-predicting the likely incidences of fraud - approximately 2-4 times the amount of fraud originally found.
These models have lower accuracy scores than their original counterparts as well, likely due to their severe oversampling. As false positives in the realm of fraud detection can mean costly investigations into non-existent fraud, we chose to leave these 4 models behind before heading to our final step.
Collecting & Refining Our Results
Now that we have significantly refined our model results, we must attend closely to our goals: to directly benefit the client’s bottom line by balancing our priorities between eliminating False Positives and False Negatives. Eliminating the former will ensure successful investigations, while reducing the latter will ensure we leave as few pursuable cases of fraud on the table, maximizing returns on damages.
Rather than narrow our perspective by using Precision or Recall measurements, we can maximize our results by prioritizing the F1 Score, which represents a balance between the two. CatBoost, LightGBM, and Support Vector Classifier all had excellent results, but if we needed to choose one model, we would go forward here with CatBoost, as it is tied for the best F1 Score with LightGBM but has a slightly higher accuracy score.
The Devil in the Details
By investigating the Permutation Importance of our best model, we can begin to get a feeling for what played the most important roles in indicating a fraudulent provider. Here we see several significant trends:
- Payout-Related Information: Indicative of monetarily motivated fraud
- Inpatient Processing Records: Suggestive that fraud is linked with Inpatient claims
- Patient Traffic Information: Bigger institutions with vastly more traffic are likely to have more unintentional fraud, that is fraudulent behavior as a result of clerical error
A new trend observed suggests that either fraud is more common at larger medical institutions, or that there are more opportunities to commit unintentional fraud through clerical error. We would advise a client to dedicate resources to determining the nature of the fraud at these institutions so malicious fraud can be prioritized.
Sifting through the results, however, we see that - as predicted in the beginning of our analysis - that a patient’s demographic information played an outsized role, specifically a patient’s race, rather the percentage measure of an institution’s population of Hispanic patients.
While we managed to achieve solid results, by generating a model like this without accounting for the beneficiary demographics and what might lead them to confounding results like these, a careless data scientist might create a dangerous tool that is ultimately prejudicial and systemically biased in its results. We left it in our data so that we could demonstrate and underline this problem.
In order to avoid such a problem in the future, removing general demographic data from a data set may be the best choice of action and it would be prudent for the data scientist to better advise the client with regards to better data collection practices.
To speak specifically, rather than implementing demographic data carte blanche, instead it may be more useful to detect holes in best practices, such as providing data collection for whether the preferred language of the beneficiary was met by the institution, to measure the extent to which they were able to appropriately gain the patient’s consent.
Ensembling a Consensus: using the Power of Multiple Models
While choosing a single well-performing model does have value, we have generated 10 models with over 90% accuracy and should be able to involve all of them in generating even clearer predictions. While typical ensembling methods are specific within a model, By gathering the collected decisions and determinations of these 10 models, we can create an ensembling method that produces a higher degree of certainty in our predictions by taking a consensus.
Since these models use a wide variety of machine learning approaches, their combined intuition can help reduce bias found in any one model or method. As such, we collected how many models identified a Provider as fraudulent as well as the mean of our models’ fraud probability predictions. This gave us two ways to group our Providers - by vote, and by probability.
The insight provided by grouping the results by weight of model votes offers another utility for a client as it can offer a tiered list of fraud likelihood. If 10/10 models with different methods and approaches all agree that a Provider is fraudulent, that instills a certainty in the ML predictions more than trusting a single black box.
Furthermore, it is amazing that over 50 Providers were only identified as Fraudulent by 1 of the 10 models; if that had happened to be the sole model used, one would have no context as to its unlikeliness without the other 9 models to compare it to. With this ensembling approach, we can help eliminate the costly investigation of false positives.
On the other hand, should a client only have the resources to investigate 15 Providers, our ensembling approach gives a second ensembling tool to distinguish the likelihood of fraud: gathering the mean probability of fraud for each Provider. There are 18 providers that all 10 models agree are fraudulent. How would one choose? Within our 10/10 models, there is a wide range of mean probability scores, with some Providers having mean probability scores as low as .869 and as high as .991.
Ordering these Providers by fraud probability can give one an order in which to approach the likelihood of fraud within each grouping. Additionally, there are some 9/10 Providers - and even one 8/10 Provider - that have higher mean probability scores of fraud than some 10/10 Providers. Using these two ensmbling metrics can help a client feel confidence in the predictive power of all the models we have tuned and refined in this project.
In the future, we could peer even more deeply into our processing and what its results mean by incorporating Shapley values to assist in analyzing what features have the greatest impact (both positive and negative) towards fraud. This information could direct our client’s policies pertaining to healthcare institutions by assisting in preventing future fraud proactively.
It may also be worth pursuing a further rebalancing of our original dataset in a proportion that would reduce the oversampling found in the 4 models that did not respond well to our SMOTE 50/50 rebalancing. By selecting a less aggressive target in between our original 11/89 and 50/50 datasets, such as 30/70 split, we may be able to optimize those 4 models and achieve something closer to the expected 9.35% of Provider fraudulence. This, in turn, would give us more viable models in which to use for ensembling methods.
Conclusions & Analysis
Through the course of our project, we have found that HyperTransformation Tuning and Hyperparameter Tuning, combined with consensus-backed rankings, provides us with both high-accuracy results as well as actionable budget-conscious options for making prudent business decisions. Using this procedure, a client company would be able to predict and pursue cases of suspected fraud while yielding the highest net profit from damages and avoiding false-flag costs.
By keeping an eye on the larger picture and utilizing all of our machine learning tools in a collective harmony, we were able to create a comprehensive fraud detection system that can serve as a template for other fraud detection and machine learning endeavors in the future.