A Data Investigation of Healthcare Insurance Fraud

David Green

Posted on Oct 15, 2021

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Background

According to Blue Cross Blue Shield data, approximately 3-10% of US healthcare spending or $68-$230 billion dollars are spent on fraudulent healthcare claims and management. This is especially detrimental for those who require government assistance through systems like Medicare. Government resources are already limited, and being able to alleviate pressures that occur from fraud may allow more freedom to help those in need.

Many of these fraudulent claims come in a variety of forms, including but not limited to: Billing for services that were not provided, duplicate submission of a claim for the same service, misrepresenting the service provided, charging for a more complex or expensive service than was actually provided, and even billing for a covered service when the service actually provided was not covered.

The Problem at Hand

This project is a proof of concept on using data science as a tool to attempt to improve upon existing fraud detection models, and subsequently save government programs like Medicare millions of dollars in insurance fraud management. You may find the code for this project on my Github.

Utilizing historical data from Medicare itself, exploratory data analysis and modeling was performed, and areas for improvement upon the existing system were brought to light. The Medicare data used for this analysis comes from a problem hosted on Kaggle, with the purpose of trying to identify problematic providers who consistently submit fraudulent claims to Medicare and those who are more likely to commit healthcare fraud. Understanding their behavioral patterns, along with how they relate to inpatient/outpatient care and claims can help the healthcare system save resources and devote them to people who need them.

Within the sample datasets, there are over 1300 unique providers and ~ 140,000 unique beneficiaries, whom submitted over 500,000 claims made between November 2008 and December 2009. Data categories consist of areas like deductible paid, reimbursement amounts, provider identifiers, medical history and other insurance related descriptors.

Exploring the Data

Capstone Project: A Data Investigation of Insurance Fraud

The dataset was conveniently broken up into categories that included things like the amount reimbursed per claim, and whether or not the claim was flagged for fraud. One of the first things that stuck out about the data once it was cleaned up, was that out of the total amount of money that was to go to claims during the sample year, more than half of those funds would have gone to fraudulent claims. The difference isn't large enough to be statistically significant, but it is definitely large enough to warrant further looking-into.

Differences Between Fraudulent and Non-Fraudulent

As we continue to look at the differences in the fraudulent and non-fraudulent claims, it is visually clear that the average claim amount, flagged as fraud, is larger than the average non-fraud claim. This would realistically make sense, as a provider would want to make as much money back on a fraudulent claim as possible, so on average you'd think they would be higher amounts.

Once again, these differences did not turn out to be statistically significant, due to the very large variance in the means of the two classifications, but the average claim amounts can be used later in analysis to determine the performance of this investigation. Similarly, the total amount of non-fraud claims was also visually higher than claims flagged as fraudulent.

Data of Procedure and Diagnosis Code

image-415670-jBwltqAF | Data Science Blog

In the above figures, the proportion of fraud/non-fraud claims are organized by procedure code (left) and diagnosis code (right. Taking a look at the procedure code figure, it shows the top 10 codes based on money involved in the transaction. It is clear here that for these codes, the fraudulent claims are flagged more often than not. Given that these transactions have the most money involved, it makes sense, as if a crime is intentionally being committed, one would want to reap the most rewards from said transaction.

Looking at the diagnosis code figure, organized in a similar manner, non-fraud claims are more prevalent than vice-versa. This also kind of makes sense realistically, as there is likely more money to be made treating something than diagnosing it, therefore fraudulent claims may be less common in that regard.

image-422510-E5wApr79 | Data Science Blog

If we take a look at the top 20 physicians, by code, plotted against claim count (fraud vs non-fraud), we can see that there are definitely a number of physicians with high numbers of fraudulent flagged claims. Its possible this is just due to the procedure they commonly do, or the particular field they are in, or even the practice they work for, but it does show that there are areas where fraud is clear and these areas can theoretically be weeded out systematically.

image-801354-rddKLa4x | Data Science Blog

Although there is a very large variance in the means between fraud and non-fraud claims, when claims are plotted against providers as a whole (above), we can clearly see that fraudulent claims are buried in the mountain of verifiable ones. There isn't necessarily a cut and dry way to distinguish the claim types from one another, and as investigators, we have to be clever in our tool use.

Data on Feature Engineering

image-440160-po4ikbsI | Data Science Blog — Utilizing NetworkX a ratio of Physician Connections to Patient Connections was established to determine if providers had a specific ratio that would flag them as committing fraud

image-911476-rPM5cUjU | Data Science Blog

Unsupervised learning tools like NetworkX can be used to establish connections between data points that aren't readily available. In this case, it was used to form a ratio of the amount of physicians associated with each provider, to the amount of patients associated with each provider. This ratio, visualized in the figures above, shows that for fraudulent claims, providers tend to be connected with more patients than they are providers, however this also holds true for non-fraud claims, just to a lesser extent.

Although this difference was not statistically significant, all the ratio's were plotted on a KDE (bottom left) and by claim type (bottom right) to see if some kind of bi-modal shape would form, and thus a ratio threshold established for determining fraud. Unfortunately, the figures only further established how hard the task at hand was to resolve. It is very clear how deeply ingrained and mixed in the fraudulent claims are with non-fraud claims, to the point where their density shapes almost completely overlap.

In addition to the NetworkX ratio, other feature engineering involved, Stratifying Y-s for equal distribution by setting up a Train/Test split set to 70/30. Oversampling of training data was done utilizing SMOTE, with model accuracy showing improvement when classes were balanced . Removal of unnecessary categories was done, as they were only used to group by, no more information was needed from them to run models. Things like Beneficiary Id, ClaimID, operating and specific attending physician/diagnosis codes and admission and discharge dates etc. ranked low in feature importance during the investigation.

Data Analysis: Logistic Regression & Random Forest

image-078274-HgHlT2NO | Data Science Blog — Visualization of Logistic Regression Machine Learning Model Results

Since the main feature of the data we are analyzing is binary, (e.g Fraud/Not Fraud), Logistic Regression was the first place to start, as this tool is good for handling binary data. After fine tuning the model to its best parameters using GridSearch, the model handled training data fairly well. It's detection of True Positives/Negatives was fairly high in both training and validation sets, when it came to the F-Score however, the model did not do amazing on unseen data.

This tends to happen when the model is good at finding patterns in the training data because it has a lot to work with, but not so great when working with a newer/smaller set of data, also known as overfitting. This type of issue can sometime be refined with certain techniques, like increasing the sample data size, or randomly removing features from the data programmatically, which forces the model to find new patterns in new data sets.

image-534283-ftSYBnvz | Data Science Blog

Model Data Analysis and Significance

If we take a look at the measures of importance taken from the LR model, the top four features that stood out were:

Per Provider Average Insurance Claim Amount Reimbursed
Insurance Claim Amount Reimbursed
Per Claim Diagnosis Code 2 Average Insurance Claim Amount Reimbursed
Per Attending Physician Average Insurance Claim Amount Reimbursed

These all make sense for the most part, as the things that we would think fraud would revolve around would be the claim amounts coming from the physicians and the individual providers. The model appears to be doing its job.

Despite the model currently being overfit, it performed well in most aspects. The rate of actual fraud that was flagged, and the amount of valid claims flagged were both fair and acceptable. The only thing of concern is the rate of false negatives, which may be a prominent source of error. Realistically, any false positive claims that are flagged by future versions of this model can be verified by providers on a case by case basis, although this number should probably come down a bit as well with further tuning.

Visualization of Random Forest Machine Learning Model Results

image-430899-daS8Bxl8 | Data Science Blog

A Random Forest model was developed alongside the LR model for comparison, and it performed similarly. It did fairly well in all aspects after tuning with GridSearch, and yielded a slightly different yet fairly similar ranking of feature importance. It too, however, suffered from a lower F-Score when it came to validation, indicating it is also overfit as well.

image-593585-E9tIS4Po | Data Science Blog

Model Evaluations/Takeaway

This investigation resulted in some interesting takeaways of the current Medicare insurance fraud issue. Physician to patient ratio shows fraud is thoroughly mixed in with non-fraud, which very much so highlights the difficulty of problem at hand. Claims difficult to distinguish from one another, and any criminal activity seems to be well hidden amongst the masses of data submitted to Medicare. Some strengths and weaknesses of the model developed were that:

Accuracy, Sensitivity and Specificity are good but F1 is low in validation
False (+) and False (-) values matter because these represent monetary value, so this effectiveness score is important
False (+) flag rate is higher than False (-) but theoretically those can be resolved on a case by case basis
True (+) flag rate is relatively high, which means the model is catching fraud properly.
False (-) flag rate is concerning source of error, we don't want fraud to go unnoticed.

Cost/Benefit Data Analysis

While error rates are concerning, the big thing when it comes to government programs that support people in need, is money. How much money do these things cost us and what can be saved. If we break down the issue, and how this model handled it, this is what we can say.

Original Data

ORIGINALLY, Medicare found 38% of claims submitted for reimbursement were fraud. If we take the average for fraudulent claim reimbursements, that’s ~ $1300 a claim IF none of them got caught.
In the training data we have ~ 558,211 total claims.
38% of total claims is ~ 212,120 claims.
At an average of 1300 per fraud claims that's $275,756,234 saved if they’re caught
The other 62% of claims accounts for ~ 346,090 claims or (x$700 per claim) that’s ~ $242,263,700
At those rates, the amount saved with Medicare’s current fraud detection method is
$275,756,234-$242,263,700 = ~$33,492,534
Under MY LR machine learning model
Total true negatives amount to 41.90% or 233,890 claims or $163,723,286 at $700 average.
Adding in the false negatives or 8.10% or ~45,215 claims or $58,779,618 at $1300 average
Totaling $222,502,904 paid out
Total True positive amount to 44.93% or 250,84 claims or $326,045,462 at $1300 saved
False positives account for 5.07% or 28,301 claims or $19,810,908 at $700 avg
If we add up the amount paid out that’s ~$242,313,812
The amount we saved catching fraud is ~$326,045,462 -$242,313,812 = ~$83,731,650

image-487967-UuUNVznI | Data Science Blog

About Author

David Green

Certified in data science, confident working in R, Python, Git and SQL development. Skilled in applying machine learning techniques in data analysis of large datasets, alongside traditional statistical analysis triage

View all posts by David Green >

No comments found.

A Data Investigation of Healthcare Insurance Fraud

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Background

The Problem at Hand

Exploring the Data