Capstone Project: An Investigation and Cost/Benefit Analysis of Insurance Fraud In Medicare

Posted on Oct 15, 2021

Background

According to Blue Cross Blue Shield, approximately 3-10% of US healthcare spending or $68-$230 billion dollars are spent on fraudulent healthcare claims and management. This is especially detrimental for those who require government assistance through systems like Medicare. Government resources are already limited, and being able to alleviate pressures that occur from fraud may allow more freedom to help those in need. Many of these fraudulent claims come in a variety of forms, including but not limited to: Billing for services that were not provided, duplicate submission of a claim for the same service, misrepresenting the service provided, charging for a more complex or expensive service than was actually provided, and even billing for a covered service when the service actually provided was not covered.

The Problem at Hand

This project is a proof of concept on using data science as a tool to attempt to improve upon existing fraud detection models, and subsequently save government programs like Medicare millions of dollars in insurance fraud management. You may find the code for this project on my Github. Utilizing historical data from Medicare itself, exploratory data analysis and modeling was performed, and areas for improvement upon the existing system were brought to light. The Medicare data used for this analysis comes from a problem hosted on Kaggle, with the purpose of trying to identify problematic providers who consistently submit fraudulent claims to Medicare and those who are more likely to commit healthcare fraud. Understanding their behavioral patterns, along with how they relate to inpatient/outpatient care and claims can help the healthcare system save resources and devote them to people who need them. Within the sample datasets, there are over 1300 unique providers and ~ 140,000 unique beneficiaries, whom submitted over 500,000 claims made between November 2008 and December 2009. Data categories consist of areas like deductible paid, reimbursement amounts, provider identifiers, medical history and other insurance related descriptors.

Exploring the Data

The dataset was conveniently broken up into categories that included things like the amount reimbursed per claim, and whether or not the claim was flagged for fraud. One of the first things that stuck out about the data once it was cleaned up, was that out of the total amount of money that was to go to claims during the sample year, more than half of those funds would have gone to fraudulent claims. The difference isn't large enough to be statistically significant, but it is definitely large enough to warrant further looking-into.

As we continue to look at the differences in the fraudulent and non-fraudulent claims, it is visually clear that the average claim amount, flagged as fraud, is larger than the average non-fraud claim. This would realistically make sense, as a provider would want to make as much money back on a fraudulent claim as possible, so on average you'd think they would be higher amounts. Once again, these differences did not turn out to be statistically significant, due to the very large variance in the means of the two classifications, but the average claim amounts can be used later in analysis to determine the performance of this investigation. Similarly, the total amount of non-fraud claims was also visually higher than claims flagged as fraudulent.

In the above figures, the proportion of fraud/non-fraud claims are organized by procedure code (left) and diagnosis code (right. Taking a look at the procedure code figure, it shows the top 10 codes based on money involved in the transaction. It is clear here that for these codes, the fraudulent claims are flagged more often than not. Given that these transactions have the most money involved, it makes sense, as if a crime is intentionally being committed, one would want to reap the most rewards from said transaction. Looking at the diagnosis code figure, organized in a similar manner, non-fraud claims are more prevalent than vice-versa. This also kind of makes sense realistically, as there is likely more money to be made treating something than diagnosing it, therefore fraudulent claims may be less common in that regard.

If we take a look at the top 20 physicians, by code, plotted against claim count (fraud vs non-fraud), we can see that there are definitely a number of physicians with high numbers of fraudulent flagged claims. Its possible this is just due to the procedure they commonly do, or the particular field they are in, or even the practice they work for, but it does show that there are areas where fraud is clear and these areas can theoretically be weeded out systematically.

Although there is a very large variance in the means between fraud and non-fraud claims, when claims are plotted against providers as a whole (above), we can clearly see that fraudulent claims are buried in the mountain of verifiable ones. There isn't necessarily a cut and dry way to distinguish the claim types from one another, and as investigators, we have to be clever in our tool use.

Feature Engineering

Utilizing NetworkX, a ratio of Physician Connections to Patient Connections was established to determine if providers had a specific ratio that would flag them as committing fraud.

Unsupervised learning tools like NetworkX can be used to establish connections between data points that aren't readily available. In this case, it was used to form a ratio of the amount of physicians associated with each provider, to the amount of patients associated with each provider. This ratio, visualized in the figures above, shows that for fraudulent claims, providers tend to be connected with more patients than they are providers, however this also holds true for non-fraud claims, just to a lesser extent. Although this difference was not statistically significant, all the ratio's were plotted on a KDE (bottom left) and by claim type (bottom right) to see if some kind of bi-modal shape would form, and thus a ratio threshold established for determining fraud. Unfortunately, the figures only further established how hard the task at hand was to resolve. It is very clear how deeply ingrained and mixed in the fraudulent claims are with non-fraud claims, to the point where their density shapes almost completely overlap.

In addition to the NetworkX ratio, other feature engineering involved, Stratifying Y-s for equal distribution by setting up a Train/Test split set to 70/30. Oversampling of training data was done utilizing SMOTE, with model accuracy showing improvement when classes were balanced . Removal of unnecessary categories was done, as they were only used to group by, no more information was needed from them to run models. Things like Beneficiary Id, ClaimID, operating and specific attending physician/diagnosis codes and admission and discharge dates etc. ranked low in feature importance during the investigation.

Data Analysis: Logistic Regression & Random Forest

Visualization of Logistic Regression Machine Learning Model Results

Since the main feature of the data we are analyzing is binary, (e.g Fraud/Not Fraud), Logistic Regression was the first place to start, as this tool is good for handling binary data. After fine tuning the model to its best parameters using GridSearch, the model handled training data fairly well. It's detection of True Positives/Negatives was fairly high in both training and validation sets, when it came to the F-Score however, the model did not do amazing on unseen data. This tends to happen when the model is good at finding patterns in the training data because it has a lot to work with, but not so great when working with a newer/smaller set of data, also known as overfitting. This type of issue can sometime be refined with certain techniques, like increasing the sample data size, or randomly removing features from the data programmatically, which forces the model to find new patterns in new data sets.

If we take a look at the measures of importance taken from the LR model, the top four features that stood out were:

  1. Per Provider Average Insurance Claim Amount Reimbursed
  2. Insurance Claim Amount Reimbursed
  3. Per Claim Diagnosis Code 2 Average Insurance Claim Amount Reimbursed
  4. Per Attending Physician Average Insurance Claim Amount Reimbursed

These all make sense for the most part, as the things that we would think fraud would revolve around would be the claim amounts coming from the physicians and the individual providers. The model appears to be doing its job.

Heat Map Showing The Accuracy Results of LR Model

Despite the model currently being overfit, it performed well in most aspects. The rate of actual fraud that was flagged, and the amount of valid claims flagged were both fair and acceptable. The only thing of concern is the rate of false negatives, which may be a prominent source of error. Realistically, any false positive claims that are flagged by future versions of this model can be verified by providers on a case by case basis, although this number should probably come down a bit as well with further tuning.

Visualization of Random Forest Machine Learning Model Results

A Random Forest model was developed alongside the LR model for comparison, and it performed similarly. It did fairly well in all aspects after tuning with GridSearch, and yielded a slightly different yet fairly similar ranking of feature importance. It too, however, suffered from a lower F-Score when it came to validation, indicating it is also overfit as well.

Model Evaluations/Takeaway

This investigation resulted in some interesting takeaways of the current Medicare insurance fraud issue. Physician to patient ratio shows fraud is thoroughly mixed in with non-fraud, which very much so highlights the difficulty of problem at hand. Claims difficult to distinguish from one another, and any criminal activity seems to be well hidden amongst the masses of data submitted to Medicare. Some strengths and weaknesses of the model developed were that:

  • Accuracy, Sensitivity and Specificity are good but F1 is low in validation 
  • False (+) and False (-) values matter because these represent monetary value, so this effectiveness score is important 
  • False (+) flag rate is higher than False (-) but theoretically those can be resolved on a case by case basis
  • True (+) flag rate is relatively high, which means the model is catching fraud properly.
  • False (-) flag rate is concerning source of error, we don't want fraud to go unnoticed.

Cost/Benefit Analysis

While error rates are concerning, the big thing when it comes to government programs that support people in need, is money. How much money do these things cost us and what can be saved. If we break down the issue, and how this model handled it, this is what we can say.

  • ORIGINALLY, Medicare found 38% of claims submitted for reimbursement were fraud. If we take the average for fraudulent claim reimbursements, that’s ~ $1300 a claim IF none of them got caught. 
    • It is important to note here that this assumption of an average is based on the fraud/not-fraud flag which occurs at the provider level. If you assume one claim was fraud by that provider, then all of their claims get put under this flag, which is not realistic, but it is acceptable for a proof of concept.
  • In the training data we have ~ 558,211 total claims. 
  • 38% of total claims is ~ 212,120 claims. 
    • At an average of 1300 per fraud claims thats $275,756,234 saved if they’re caught
  • The other 62% of claims accounts for ~ 346,090 claims or (x$700 per claim) thats ~ $242,263,700
  • At those rates, the amount saved with medicare’s current  fraud detection method is 
    • $275,756,234-$242,263,700 = ~$33,492,534
  • Under the developed Logistic Regression machine learning model
    • Total true negatives amount to 43% or 240,030 claims or $168,021,000 at $700 average. 
    • Adding in the false negatives or 4.6% or ~25,678 claims or $33,381,018 at $1300 average 
    • Totaling $201,402,018 paid out
      1. Total True positive amount to 46% or 256,777 claims or $333,810,100 at $1300 saved
      2. False positives account for 8.6% or 50,586 claims or $35,410,302 at $700 avg
    • If we add up the amount paid out that’s ~$236,812,320
    • The amount we saved catching fraud is ~$333,810,100-$236,812,320 = ~$96,997,780
      • This data is based on the results from the training set, as the validation set did not perform as well. However, with further tuning, these results are possible.
Visualization of Cost Benefit Analysis of Machine Learning Model Results

About Author

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp