Utilizing Data to Detect the Red Flags of Medicare Fraud

Posted on Oct 28, 2021
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

LinkedIn | Other Work | Github Repository


Medicare is the federal health insurance program of the United State of America for people sixty-five and older or with special conditions. Medicare fraud is a multi-billion-dollar problem not only for the Medicare program and the United States Government but also for the recipients forced to pay higher premiums to offset the fraud. Data shows that improper payments total almost $50 billion each year to doctors, hospitals, and networks committing fraud.

Unlike centralized healthcare systems, Medicare acts as the healthcare insurer rather than the service provider. Subsequently, this allows most healthcare providers to accept Medicare patients but also decreases the ability for fraud oversight.

Medicare fraud is committed in numerous ways and fraudsters become more creative and subtle as overt fraud is caught and prosecuted. Several common methods for fraud are upcoding diagnosis, unbundling procedures, kickbacks to other providers, and overbilling. These improper payouts not only sap money from Medicare but also increase the premiums paid by all Medicare recipients to cover the fraudulent payments.

We use a Kaggle dataset encompassing a subset of Medicare claims. This dataset covers more than a year of claims between a group of patients and providers. There are inpatient and outpatient claims tables, as well as information regarding the beneficiaries (patients) and a key indicating if a provider is suspected of fraud.



The first step in understanding our problem is understanding what is missing in our dataset. Luckily, this is a high-quality dataset and almost all missingness is intentional.

This occurs in several ways. The largest sources of missingness are Diagnosis and Procedure codes, which can be seen on the right side of both graphs in Fig 1. These codes, along with the corresponding physicians, are intentionally missing because not all patients need all 10 diagnosis codes, even fewer need all 6 procedure codes, and many patients lack an Operating Physician because they did not undergo an operation. It seems that the only real source of missingness is the amount of deductible paid by the patient; several hundred values missing which seems to indicate payment was not collected.

Utilizing Data to Detect the Red Flags of Medicare Fraud
Figure 1: Missingness plots for inpatient and outpatient Claims. Grey bars signify filled data, white areas are missing values.

Data on Fraudulence

To predict fraudulent providers, we are given a key indicating if a particular provider (represented by a unique code) is suspected of fraud. In this key, there is a 9% prevalence of fraud. These fraudulent providers make up 36% of outpatient claims, 58% of inpatient claims, and 38% of overall claims (Fig. 2).

Utilizing Data to Detect the Red Flags of Medicare Fraud
Figure 2: Provider Fraudulence by Source Table

This is because fraudulent providers are submitting more claims than non-suspect providers. While most non-suspect providers are submitting 10-200 claims throughout our dataset, suspect providers are submitting 100 to 1000 claims (Fig. 3).

Utilizing Data to Detect the Red Flags of Medicare Fraud
Figure 3: A kernel density estimation plot for the proportion of suspect and non-suspect providers by the number of submitted claims. With Box-Cox normalization this difference is statistically significant with a Welch’s t-test p-value of 9.9e-109.

While most analyses would aggregate together information about all claims by the provider, this approach misses an important aspect of our data and the real-world problem. Most aggregation techniques (and models built from them) will change significantly when the timeframe of the aggregation is changed.

Aggregation also requires that sufficient time to pass to collect data, which in this case means allowing a provider to commit fraud for a year before rendering a decision on their status. Thus, a better way is to identify the indicators of fraud within each claim and render a decision on that level.

Diagnosis and Procedure Codes Data

One of the most common forms of fraud is upcoding and unbundling, two techniques in which diagnosis and procedure codes are either increased in severity or separated into multiple codes when they should be reported as a single code. These codes are derived from the IDC-9-CM system, which codifies the diagnoses and procedures in the United States.

As we might expect, when we investigate the diagnosis and procedure code distribution across inpatient and outpatient claims, we see that these claims have a unique code distribution (Fig. 4). Inpatients tend to have more diagnosis and procedure codes than outpatients; About 60% of inpatients have 1 or 2 codes and almost no outpatients have any procedure codes. Interestingly, over 60% of inpatients have 9 procedure codes while only 20% have 8 or 10 codes.

Figure 4: Relative Frequency for Inpatient & Outpatient, Diagnosis and Procedure code count.

One way to visualize this relationship in the context of fraud is using a heatmap (Fig. 5). While a high number of procedure codes per diagnosis code increases the likelihood of fraud, very few patients have more than 2 procedure codes, so this feature only targets a small fraction of fraudulent providers.

Figure 5: Heatmap of Diagnosis and Procedure Codes, colored by the likelihood of suspected fraud.

Removed Features and Protected Classes

Some features in our dataset held little or sensitive information. One example is the beneficiaries table, which provides patient information such as Medicare type (A or B). This was conveyed in two columns denoting the number of months the patient had participated in Medicare A and B. However, this information is not relevant to the fraud analysis, as evidenced by the fact that both Medicare A and B are heavily skewed to 12 months and there was no clear signal across the months of participation indicating an abundance of fraud (Fig. 6).

Figure 6: Number of Months of Medicare Part A (left) and Part B (right) vs Abundance of Fraud (points) and Proportion of Patients (bars). The line indicates the overall fraud rate of claims.

Furthermore, the beneficiary table also included race and gender, which are protected classes in the United States. Machine learning models tend to use race and gender information in biased ways, and within our context, there isn’t much information to be gained from incorporating these features (Fig. 7). Thus, for these reasons the beneficiary table was not considered in the fraud analysis.

One possible downside to this decision is that predatory providers might target specific groups as one aspect of their fraud, and thus removing these features prevents the model from identifying this trend. However, without information to support the predatory provider hypothesis, it is better to drop these sensitive features.

Figure 7: Fraud (point) and Population (bar) Abundance for Race (left) and Gender (right).

Data on Feature Cleaning and Engineering

The last step before modeling is feature cleaning and engineering. Several columns needed minor cleaning, such as characters to numerical values and filling the missing values of Deductible Amount Paid to 0.

Several types of features were engineered in our dataset, most notably the count of diagnosis, procedure, and chronic conditions. Multiple accumulating features were created, including Time Since Last Claim, Total Claim Days, Cumulative Visits, Cumulative Provider Visits (Fig. 8), and cumulative payments. Another set of features divides the Insurance Claim Payment amongst the physicians, the number of codes, and days of care.

A feature interaction of note is the Claim End Date and Discharge Date. For the few claims which have a different Claim End Date vs Discharge Date, all claims are from fraudulent providers. A pure node!

Figure 8: Proportion of Fraudulent Providers by the Cumulative Number of Patient Visits to a Provider

Since the models to be employed are all tree-based models, there is no pressing need to encode that data however for good practice the continuous features were encoded using a Power Transformer.



While using claims instead of providers offers a more balanced dataset, the 40-60 split still affects our model. To remedy this, I employed Synthetic Minority Oversampling Technique (SMOTE), which creates new data for the minority class (claims from suspect providers) by placing these new values ‘between’ two existing values in high-dimensional space. SMOTE-NC specializes in a mixture of nominal and continuous data; and SMOTE-Tomek removes points whose nearest neighbor is a point of the opposite class, sharpening the boundary between the data (Fig. 9).

Figure 9: Number of Suspect and non-suspect Claims before and after Upsampling via various methods.

Model Selection

The upsampled data was then fed into the out-of-box Random Forest, Gradient Boost, XGBoost, and CatBoost models. The accuracy of each model was assessed. Across all models, those trained with SMOTE-Tomek performed better than SMOTE alone or the original data (Fig. 10). Between models, XGBoost and CatBoost performed better than Gradient Boost or Random Forest. Consequently, XGBoost and CatBoost were selected for hyperparameter tuning.

Figure 10: Cross-Validation Accuracy Scores for out-of-box models trained on Original, SMOTE, SMOTE-NC, and SMOTE-Tomek modified data.

Hyperparameter Tuning and Analysis

XGBoost and CatBoost were tuned via Randomized Search with Cross-Validation using Recall as the scoring metric. This ‘pushes’ the model towards parameters that minimize false positives (wrongly accused providers) at the expense of accuracy. After tuning CatBoost outperformed XGBoost but both models show signs of overfitting (Fig. 11). On the Test split, the tuned CatBoost model achieved 87% accuracy and 61% recall (Table 1).

Figure 11: ROC and AUC for Train and Test split for XGBoost and CatBoost.
Metric Train Split Test Split
Accuracy 89% 78%
Recall 84% 61%
Precision 93% 77%
F1-score 88% 68%
Table 1: CatBoost Train and Test split Scores

We can see this visually with a confusion matrix (Fig. 12) where nearly a third of suspected providers are mislabeled as non-suspect (bottom left). However, due to scoring by the recall, very few innocent providers as improperly suspected (top right). This means that while our model can appropriately classify non-suspect claims, it is under-identifying fraudulent claims.

Figure 12: Confusion Matrix for CatBoost Predictions on the Test Split.

Feature Importance

Instead of the standard feature importance, which is biased towards features with high cardinality and is more sensitive to outliers, permutation feature importance was implemented. Permutation importance iteratively shuffles a feature and examines the drop in score when the relationship between feature and target is disrupted.

In the tuned CatBoost model, we see that State and County have strong feature importance (Fig. 13). This is likely due to a limited number of providers in each state/county, so narrowing down to a single provider provides information on all claims submitted by the said provider (otherwise known as data leakage). This will be remedied in subsequent models.

Several of the engineered features such as chronic condition counts, code counts, cost per procedure code, and the number of patient visits to a provider as in the top-ranked features. However, other engineered features such as the date discrepancy are rated low, likely due to the limited number of samples affected. Additionally, many Boolean features such as the chronic condition had low feature importance.

Figure 13: Permutation Importance of Features used in the best model.


At this point, we have built a model which performs admirably at predicting the likelihood of fraud for any claim. This is where many presentations traditionally end. However, we were asked to predict whether a provider is suspected of fraud, not a given claim. One way we can do this is by using the predicted probability of fraud for each claim and mapping that back to the provider who originated the claim (Fig. 14)

Figure 14: Probability of Fraud for each claim submitted by non-suspect (left) and suspect (right) providers.

Using an accumulating average, we can predict the fraudulence of each provider based on the fraud probability of all their submitted claims. While there is some gain in accuracy (Fig. 15), recall and precision suffer (Table 2) due to a high rate of false negatives and false positives.

Figure 15: (Left) ROC and AUC for Train and Test split for CatBoost Aggregation. (Right) Confusion Matrix for CatBoost Aggregation Predictions on the Test Split.
Metric Test Split Aggregated Prediction
Accuracy 87%
Recall 43%
Precision 38%
Table 2: CatBoost Aggregation Test Split Scores

Are All Claims Needed?

However, not all the claims submitted by providers are needed to make an accurate prediction regarding the provider. From 1 to 25 claims, we increase AUC from 74% with 1 claim to 84% using 25 claims (Fig. 16). The sweet spot appears to be up to 10 claims, which yields nearly identical scores compared to using all claims submitted across the year of our dataset.

While only 75% of providers submit more than 10 claims, 90% of suspect providers submit at least 10 claims. Therefore, we can make a fraud prediction after receiving 10 claims (which are on average submitted within 2 months), instead of waiting to accumulate a years’ worth of claims.

Figure 16: (Left) ROC and AUC for Train, Test, and Test-using the nth claim for CatBoost Aggregation. (Right) Confusion Matrix for CatBoost Aggregation Predictions comparing all claims vs the first 10 claims.

Finally, we want to examine what our model gets wrong. One way to do this is by examining what the distributions of claims look like for innocent and suspect providers. Correctly classified providers have inverse- ‘T’ and ‘T’ shaped distributions, respectively, while misclassified providers tend to have more ambiguous distributions (Fig. 17). This may be due to the non-suspect claims being mixed with suspect claims, or because our model lacks a feature able to confidently discriminate these claims.

Figure 17: A confusion matrix of providers' fraud prediction based on the distribution of claims for each provider. Each violin represents the fraud probability of claims submitted by a single provider.


Fighting Medicare fraud is a constant battle, and this model is only one tool in our arsenal. We can predict individual claims with a 78% accuracy and aggregate up predictions to make predictions about providers with an 89% accuracy. This prediction holds up even using a limited number of claims, supplying us with multiple opportunities to identify and prosecute fraudulent providers.

Some additional improvements to the model could be claim-specific labeling as opposed to the provider-level labels. Furthermore, the diagnosis and procedure codes contain more information than we currently use, and thus building an NLP on the definitions might prove fruitful. Finally, using the claim predictions to inform a secondary model trained on aggregated predictions shows promise given the results of other projects.

This concludes not only my capstone project but also my time with New York City Data Science Academy. My thanks go out to Gabi Morales, Sam Audino, and the whole NYCDSA team for their support and guidance.

About Author

James Welch

I was trained as a synthetic biologist and I am working to become a data scientists too. I have expertise in the genetic engineering of a variety of single-celled organisms, DNA and protein design, and industrial process scaling...
View all posts by James Welch >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI