A Data Investigation of Healthcare Insurance Fraud

Posted on Oct 15, 2021
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


According to Blue Cross Blue Shield data, approximately 3-10% of US healthcare spending or $68-$230 billion dollars are spent on fraudulent healthcare claims and management. This is especially detrimental for those who require government assistance through systems like Medicare. Government resources are already limited, and being able to alleviate pressures that occur from fraud may allow more freedom to help those in need.

Many of these fraudulent claims come in a variety of forms, including but not limited to: Billing for services that were not provided, duplicate submission of a claim for the same service, misrepresenting the service provided, charging for a more complex or expensive service than was actually provided, and even billing for a covered service when the service actually provided was not covered.

The Problem at Hand

This project is a proof of concept on using data science as a tool to attempt to improve upon existing fraud detection models, and subsequently save government programs like Medicare millions of dollars in insurance fraud management. You may find the code for this project on my Github.

Utilizing historical data from Medicare itself, exploratory data analysis and modeling was performed, and areas for improvement upon the existing system were brought to light. The Medicare data used for this analysis comes from a problem hosted on Kaggle, with the purpose of trying to identify problematic providers who consistently submit fraudulent claims to Medicare and those who are more likely to commit healthcare fraud. Understanding their behavioral patterns, along with how they relate to inpatient/outpatient care and claims can help the healthcare system save resources and devote them to people who need them.

Within the sample datasets, there are over 1300 unique providers and ~ 140,000 unique beneficiaries, whom submitted over 500,000 claims made between November 2008 and December 2009. Data categories consist of areas like deductible paid, reimbursement amounts, provider identifiers, medical history and other insurance related descriptors.

Exploring the Data

Capstone Project: A Data Investigation of Insurance Fraud

The dataset was conveniently broken up into categories that included things like the amount reimbursed per claim, and whether or not the claim was flagged for fraud. One of the first things that stuck out about the data once it was cleaned up, was that out of the total amount of money that was to go to claims during the sample year, more than half of those funds would have gone to fraudulent claims. The difference isn't large enough to be statistically significant, but it is definitely large enough to warrant further looking-into.

Differences Between Fraudulent and Non-Fraudulent

Capstone Project: A Data Investigation of Insurance Fraud
Capstone Project: A Data Investigation of Insurance Fraud

As we continue to look at the differences in the fraudulent and non-fraudulent claims, it is visually clear that the average claim amount, flagged as fraud, is larger than the average non-fraud claim. This would realistically make sense, as a provider would want to make as much money back on a fraudulent claim as possible, so on average you'd think they would be higher amounts.

Once again, these differences did not turn out to be statistically significant, due to the very large variance in the means of the two classifications, but the average claim amounts can be used later in analysis to determine the performance of this investigation. Similarly, the total amount of non-fraud claims was also visually higher than claims flagged as fraudulent.

Data of Procedure and Diagnosis Code

In the above figures, the proportion of fraud/non-fraud claims are organized by procedure code (left) and diagnosis code (right. Taking a look at the procedure code figure, it shows the top 10 codes based on money involved in the transaction. It is clear here that for these codes, the fraudulent claims are flagged more often than not. Given that these transactions have the most money involved, it makes sense, as if a crime is intentionally being committed, one would want to reap the most rewards from said transaction.

Looking at the diagnosis code figure, organized in a similar manner, non-fraud claims are more prevalent than vice-versa. This also kind of makes sense realistically, as there is likely more money to be made treating something than diagnosing it, therefore fraudulent claims may be less common in that regard.

If we take a look at the top 20 physicians, by code, plotted against claim count (fraud vs non-fraud), we can see that there are definitely a number of physicians with high numbers of fraudulent flagged claims. Its possible this is just due to the procedure they commonly do, or the particular field they are in, or even the practice they work for, but it does show that there are areas where fraud is clear and these areas can theoretically be weeded out systematically.

Although there is a very large variance in the means between fraud and non-fraud claims, when claims are plotted against providers as a whole (above), we can clearly see that fraudulent claims are buried in the mountain of verifiable ones. There isn't necessarily a cut and dry way to distinguish the claim types from one another, and as investigators, we have to be clever in our tool use.

Data on Feature Engineering

Utilizing NetworkX, a ratio of Physician Connections to Patient Connections was established to determine if providers had a specific ratio that would flag them as committing fraud.

Unsupervised learning tools like NetworkX can be used to establish connections between data points that aren't readily available. In this case, it was used to form a ratio of the amount of physicians associated with each provider, to the amount of patients associated with each provider. This ratio, visualized in the figures above, shows that for fraudulent claims, providers tend to be connected with more patients than they are providers, however this also holds true for non-fraud claims, just to a lesser extent.

Although this difference was not statistically significant, all the ratio's were plotted on a KDE (bottom left) and by claim type (bottom right) to see if some kind of bi-modal shape would form, and thus a ratio threshold established for determining fraud. Unfortunately, the figures only further established how hard the task at hand was to resolve. It is very clear how deeply ingrained and mixed in the fraudulent claims are with non-fraud claims, to the point where their density shapes almost completely overlap.

In addition to the NetworkX ratio, other feature engineering involved, Stratifying Y-s for equal distribution by setting up a Train/Test split set to 70/30. Oversampling of training data was done utilizing SMOTE, with model accuracy showing improvement when classes were balanced . Removal of unnecessary categories was done, as they were only used to group by, no more information was needed from them to run models. Things like Beneficiary Id, ClaimID, operating and specific attending physician/diagnosis codes and admission and discharge dates etc. ranked low in feature importance during the investigation.

Data Analysis: Logistic Regression & Random Forest

Visualization of Logistic Regression Machine Learning Model Results

Since the main feature of the data we are analyzing is binary, (e.g Fraud/Not Fraud), Logistic Regression was the first place to start, as this tool is good for handling binary data. After fine tuning the model to its best parameters using GridSearch, the model handled training data fairly well. It's detection of True Positives/Negatives was fairly high in both training and validation sets, when it came to the F-Score however, the model did not do amazing on unseen data.

This tends to happen when the model is good at finding patterns in the training data because it has a lot to work with, but not so great when working with a newer/smaller set of data, also known as overfitting. This type of issue can sometime be refined with certain techniques, like increasing the sample data size, or randomly removing features from the data programmatically, which forces the model to find new patterns in new data sets.

Model Data Analysis and Significance

If we take a look at the measures of importance taken from the LR model, the top four features that stood out were:

  1. Per Provider Average Insurance Claim Amount Reimbursed
  2. Insurance Claim Amount Reimbursed
  3. Per Claim Diagnosis Code 2 Average Insurance Claim Amount Reimbursed
  4. Per Attending Physician Average Insurance Claim Amount Reimbursed

These all make sense for the most part, as the things that we would think fraud would revolve around would be the claim amounts coming from the physicians and the individual providers. The model appears to be doing its job.

Despite the model currently being overfit, it performed well in most aspects. The rate of actual fraud that was flagged, and the amount of valid claims flagged were both fair and acceptable. The only thing of concern is the rate of false negatives, which may be a prominent source of error. Realistically, any false positive claims that are flagged by future versions of this model can be verified by providers on a case by case basis, although this number should probably come down a bit as well with further tuning.

Visualization of Random Forest Machine Learning Model Results


A Random Forest model was developed alongside the LR model for comparison, and it performed similarly. It did fairly well in all aspects after tuning with GridSearch, and yielded a slightly different yet fairly similar ranking of feature importance. It too, however, suffered from a lower F-Score when it came to validation, indicating it is also overfit as well.

Model Evaluations/Takeaway

This investigation resulted in some interesting takeaways of the current Medicare insurance fraud issue. Physician to patient ratio shows fraud is thoroughly mixed in with non-fraud, which very much so highlights the difficulty of problem at hand. Claims difficult to distinguish from one another, and any criminal activity seems to be well hidden amongst the masses of data submitted to Medicare. Some strengths and weaknesses of the model developed were that:

  • Accuracy, Sensitivity and Specificity are good but F1 is low in validation 
  • False (+) and False (-) values matter because these represent monetary value, so this effectiveness score is important 
  • False (+) flag rate is higher than False (-) but theoretically those can be resolved on a case by case basis
  • True (+) flag rate is relatively high, which means the model is catching fraud properly.
  • False (-) flag rate is concerning source of error, we don't want fraud to go unnoticed.

Cost/Benefit Data Analysis

While error rates are concerning, the big thing when it comes to government programs that support people in need, is money. How much money do these things cost us and what can be saved. If we break down the issue, and how this model handled it, this is what we can say.

Original Data

  • ORIGINALLY, Medicare found 38% of claims submitted for reimbursement were fraud. If we take the average for fraudulent claim reimbursements, thatโ€™s ~ $1300 a claim IF none of them got caught. 
  • In the training data we have ~ 558,211 total claims. 
  • 38% of total claims is ~ 212,120 claims. 
  • At an average of 1300 per fraud claims that's $275,756,234 saved if theyโ€™re caught
  • The other 62% of claims accounts for ~ 346,090 claims or (x$700 per claim) thatโ€™s ~ $242,263,700
  • At those rates, the amount saved with Medicareโ€™s current  fraud detection method is 
  • $275,756,234-$242,263,700 = ~$33,492,534
  • Under MY LR machine learning model
  • Total true negatives amount to 41.90% or 233,890 claims or $163,723,286 at $700 average. 
  • Adding in the false negatives or 8.10% or ~45,215 claims or $58,779,618 at $1300 average 
  • Totaling $222,502,904 paid out
  • Total True positive amount to 44.93% or 250,84 claims or $326,045,462 at $1300 saved
  • False positives account for 5.07% or 28,301 claims or $19,810,908 at $700 avg
  • If we add up the amount paid out thatโ€™s ~$242,313,812
  • The amount we saved catching fraud is ~$326,045,462 -$242,313,812 = ~$83,731,650

About Author

David Green

Certified in data science, confident working in R, Python, Git and SQL development. Skilled in applying machine learning techniques in data analysis of large datasets, alongside traditional statistical analysis triage
View all posts by David Green >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI