Healthcare Fraud Detection

Posted on Jul 5, 2022


In recent years the rate at which doctors, and hospitals have conducted fraudulent activities, scams, and schemes have troubled authorities. The Department of Justice (DOJ) recovered over $3 billion from False Claims cases in the 2019 fiscal year, with $2.6 billion coming from healthcare fraud schemes. DOJ also reported that the billions of dollars stemming from healthcare fraud cases involved a wide range of stakeholders, including drug and medical device manufacturers. The stakeholders also included care providers, hospitals, pharmacies, hospice organizations, laboratories, and physicians.

Fraud Investigation

Healthcare/Medicare fraud is more prevalent among medical providers and usually results in higher health care costs, insurance premiums, and taxes for the general population. Medical Providers try to maximize reimbursement received from Medicare which they are not entitled to via illegitimate activities such as submitting false claims. This capstone project will focus on fraud committed by doctors and hospitals. Using real-life Medicare claims data, I have attempted to identify key healthcare fraud indicators and fraudulent provider characteristics which could be used in Medicare fraud investigation via supervised machine learning. Machine learning classification algorithms will be used in an attempt to classify providers as fraud or non-fraud.

Healthcare Fraud Overview


As per the FBI, health care fraud can be committed by medical providers, patients, and others who intentionally deceive the health care system to receive unlawful benefits or payments. Some of the common ways that medical providers deceive patients/insurance providers through claims procedures are listed below:

  • Billing for care not rendered.
  • Submitting duplicate claims.
  • Falsifying claim/patient info.
  • Disguising non-covered services as covered services.
  • Using incorrect diagnosis/procedure codes.
  • Stealing a Medicare number or card and using it to submit fraudulent claims.

Let us also look at some of the common terminology associated with healthcare, which includes some of the offences described above. As per a Medicare Advantage article, some of the common ways in which illegitimate Medicare spending may be carried out are as follows:

  • Double Billing:
    • This type of Medicare fraud involves deliberately charging twice for a service or product that was only performed or supplied once.
  • Phantom Billing:
    • This involves billing for a test or procedure or other medical service that was never actually performed. This is one of the most common forms of Medicare fraud
  • Upcoding:
    • Upcoding is altering the codes assigned to specific billable services to reflect a higher-level service than what was actually performed. This type of scam is carried out to receive a fraudulently higher Medicare reimbursement than what is required.
  • Unbundling:
    • This involves taking a comprehensive service and separating it into several specific services in order to bill for each one independently. This leads to a higher reimbursement total.
  • Kickbacks:
    • Kickbacks occur when a provider accepts payment on behalf of a pharmaceutical company or medical device supplier. This is done in exchange of recommending or prescribing patients to use the product.

Medicare Claims Dataset

The Medicare claims data used in this project comes from data uploaded to Kaggle - Healthcare Provider Fraud Detection Analysis by Rohit Anand Gupta. The data is comprised of three sub-datasets; their details are listed below.

In the Beneficiary dataset, we get patient-level information such as their age, race, gender, geographical conditions, chronic conditions, deductible paid, reimbursement received, etc. The Inpatient and Outpatient dataset comprises of claim-level information for those patients. These datasets include information such as associated hospital, associated physicians, claim start/end date, discharge start/end date, and diagnosis/procedure codes associated with the claim.

Another key piece of information that is included in this dataset is the fraud labels. In this data, the fraud labels are placed upon the medical providers/hospitals. The fraud labels indicate if the hospitals are possibly fraud or non-fraud. Based on the initial review, we can see that the labels provided are highly imbalanced; almost all the doctors are labeled as non-fraud. Such kind of imbalance is detrimental while data modeling, especially for classification tasks as we run into the risk of labeling all our providers as non-fraud. The data was balanced using up sampling techniques, which are discussed further in the blog.

Fraud Data Preprocessing

Before getting into some details about the extensive data analysis done on the claims data, I would like to discuss how the data was preprocessed. First off, the missingness in the data was handled. The data included a lot of missing values, such as missing date of death if the patient is alive and missing operating physician if no surgical operation was performed. Missing information was imputed accordingly. Also, for uniform and efficient preprocessing, all categorical data was label encoded.

I also decided to keep outliers in the data, as the outliers could provide key fraud indicator information. These outliers could very well be transactions where actual fraud is being committed. This was also the reason why the data was robust scaled before modeling. Another important preprocessing step done was up sampling the data to reduce the imbalance and fraud label ratio to 1:1. The data was processed via two up sampling techniques; SMOTE (creates data randomly between two data points) and BorderlineSMOTE (creates data along the decision boundary between the two classes) and performances were compared between both.

Next, I also created new features and dropped some redundant ones. New features that give information on whether the patient was deceased or not, duration of the hospital stay/claim, number of associated doctors/claims, number of chronic conditions the patient has, etc. were created. Also, some other features with high null values or ones from which other features were created were dropped. After all of the preprocessing all the three datasets were combined into one to create one training and testing dataset.

Beneficiary Information Analysis

Before we proceed, let us look at who our patients are. If we look at the graphs above, we see that majority of our patients belong to race encoded as 0 and gender encoded as 1. Most of the patients fall between the ages of 68 through 82 years old; however, we have some outliers as well. Almost all our patients are alive. I also studied the top beneficiaries that paid the highest deductible and for whom the highest total reimbursements were received. There are several beneficiaries that are common in both groups as we can see from the two graphs below.

Fraud vs Non-Fraud Providers Study:

To understand what the key fraud provider characteristics are, I extensively studied the inpatient/outpatient data based on the fraud labels provided. I attempted to uncover what sets some of these fraud providers apart from the non-fraud providers. Following, are some of the findings that were uncovered through the study (comparison done between inpatient/outpatient datasets and fraud/non-fraud providers):

Maximum Reimbursement Amounts

These graphs below detail the distribution of the maximum total reimbursement amount received for fraud and non-fraud providers between the inpatient and outpatient claims. There is a difference in the average maximum reimbursement amounts received for both types of providers in the inpatient dataset. A similar difference is not seen between the outpatient fraud and non-fraud providers; however, we can see that fraud providers claimed some of the highest reimbursements.

The bar graphs below show the top providers with high maximum reimbursement amounts (both inpatient and outpatient datasets) and how many of those were fraud vs non-fraud. In the top inpatient providers, all except one of the providers are labeled as fraud. In the top outpatient providers, there is a 50:50 division; however, the highest reimbursements were claimed by fraud providers.

Number of Claims

These graphs below detail the distribution of the total number of claims submitted for fraud and non-fraud providers between the inpatient and outpatient claims. For both inpatient and outpatient datasets, fraud providers had an extremely high number of claims submitted than the non-fraud providers.

The bar graphs below show the top providers with the high total number of claims submitted (both inpatient and outpatient datasets) and how many of those were fraud vs non-fraud. All the top providers for both datasets are labeled as fraud.

Diagnosis Code Counts

These graphs below detail the distribution of the total number diagnosis codes listed on claims for fraud and non-fraud providers between the inpatient and outpatient claims. In Inpatient providers, the average code counts are higher for non-fraud providers than the fraud providers; however, exactly opposite is true for the outpatient providers.

The bar graphs below show the top providers with the high total number of diagnosis codes listed on claims (both inpatient and outpatient datasets) and how many of those were fraud vs non-fraud. For the top inpatient providers, all except one of the providers is labeled as fraud; whereas, in the top outpatient providers few providers have the fraud label associated with them.

Average Patient Age/Chronic Condition Counts

Next, I also looked at average patient age and chronic condition counts between inpatient and outpatient providers for both types of providers. It seems like from the graphs below that for both inpatient and outpatient providers, the range of patient age is narrower for the fraud providers than the non-fraud providers. Likewise, the range of patient chronic condition counts is also narrower for the fraud providers than the non-fraud providers.

Patients per state - Fraud Providers

The last avenue that I explored as a part of this study was, to look at where the majority of the patients are residing for fraud providers in the inpatient and outpatient datasets.

From these graphs, it looks like most of the patients for the fraud providers in both inpatient and outpatient datasets are coming from few common states. States that encoded as 5, 30 and 33 have the highest number of patients who are associated with a fraud labeled medical provider.

Classification Task - Data Modeling

After all the in-depth data analysis, I moved on to the data modeling part using python and machine learning classification algorithms. For modeling, I used the training dataset; did a 70:30 train-test split and evaluated the results for SMOTE and BorderlineSMOTE unsampled data. The model performances were evaluated based on the F1 score, which achieves a harmonious balance between precision and recall.

Based, on these graphs we can see that there is not much difference in performance between data up sampled via two different up sampling techniques. The Linear SVC model performs the worst in this case and completely misclassifies the majority of the providers. However, the XGBoost and the LightGBM models (final features selected via Recursive Feature Selection) perform much better in terms of classifying between both classes. In the next section we will look at performance evaluation of our better performing LightGBM model.

LightGBM Performance Evaluation

One of the best-performing models out of all the model types attempted was the LightGBM model. We can see the reason why it performed well through some of the classification model performance metrics. First off, we will look at the confusion matrix. This confusion matrix shows us correct vs misclassifications as predicted by the model. We can see that this model does a good job of classifying non-frauds as non-frauds and possibly frauds as such. I was also able to achieve better possibly-fraud detections by adjusting the model classification threshold a bit.

Confusion matrix

AUC/ROC and Precision-Recall Curve:

Next, we will look at the ROC and Precision-Recall Curve for the LightGBM classifier. The AUCROC curves allow us to visualize the tradeoff between a model's sensitivity and specificity. Ideally, the true-positive rate should be closer to one and the false positive rate should be closer to zero. Additionally, the higher the area under the curve (computation of the relationship between false positives and true positives) the better the model is. Our LightGBM model seems to do well in this regard; it has a high and steep ROC curve with an AUC score of 0.94.

Next, we will look at the precision-recall curve for this model. Precision-Recall curve shows the tradeoff between result relevancy and completeness. The goal always would be to maximize both precision and recall and have a high area under the curve. Also, having a high average precision is also highly desirable. The LightGBM model achieves both in this case.

Class Prediction Error

Lastly, we will also look at the Class Prediction Error plot for the LightGBM model. This Class Prediction Error plot allows us to see how efficient our classifier is at predicting correct classes. For our model, this plot shows us that our model does a good job predicting the majority of the classes correctly and the misclassification rate is relatively low.

Model Feature Importances

One way we can understand what factors are important while trying to distinguish between fraud and non-fraud providers is to look at model feature importances. This feature allows us to examine which attributes contribute to the model's classifying capability. In this section we will look at feature importances for three models: XGBoost, LightGBM, and Random Forest.

If we look at the results from the XGBoost and the LightGBM model, we can see the same top four features contribute the most in terms of decision-making. These are the attending physician or the primary doctor, the county, the state the patient is from, and other physicians listed on the claim. If we look at top features by weight for the Random Forest model, we again see the attending physician or the primary doctor, county, and the state the patient is from.

SHAP Value Analysis

The last thing that I looked at in terms of feature evaluation was the SHAP values of the important features in the model. SHAP values are calculated based on the game-theory approach where weightage is assigned to different features based on how much they contribute to the model's prediction capability. The first graph which is shown below, tells us about the top features based on the SHAP values and their overall effect on the model. We will look at the results for the XGBoost and the LightGBM model.

Both the models, share the top five features themselves. These features are the patient's birth year, age, state/county they belong to, and their primary physician listed on the submitted claim. The x-axis on the graph below all the SHAP values plotted for the feature and whether they positively or negatively impact the model and the colors (from red to blue) tell us whether the feature value is high or low.

We will also look at top features ordered by total mean SHAP values for the linear models. These linear models tell us a different story. For these models the total claim amount, the insurance claim amount reimbursed, and the deductible paid by the patient were the top influencers in the prediction decision-making.


When I first started looking at this data, I looked at the beneficiaries/patients first. Some key insights that I gathered through this exploration were:

  • Certain beneficiaries listed below could be actively experiencing fraud or could be more susceptible to fraud.
    • Patients for whom high reimbursements were received.
    • Patients who have paid high deductibles.
    • Some of the aforementioned patients that have high chronic condition counts.

Next, I studied the fraud and the non-fraud providers from the inpatient and the outpatient dataset and came up with following distinguishing characteristics between the two:

One other thing to note is that possibly fraud providers could be more active in certain states and counties. A patientโ€™s age being in a certain range, which state/county they are from, their total claim amount, and who their primary doctor is could in certain cases make them more vulnerable to fraud. These features could also help investigators differentiate between fraud and non-fraud providers.

Future work:

When I started working on this project, I came to realize that I am only just scratching the surface in terms of deciphering the black box of healthcare fraud detection. The possibilities are limitless in the type of work we can do or the areas we can focus on to zero in on fraud providers. Given more time some things that I would love to try are:

  • Duplicate claim investigation.
  • Doctor-Hospital Network Analysis.
  • Studying patterns in beneficiaries.
  • Conducting a market basket analysis.


  • The US Department of Justice. 2022. Justice Department Recovers over $3 Billion from False Claims Act Cases in Fiscal Year 2019. [online] Available at: <> [Accessed 28 June 2022].
  • Federal Bureau of Investigation. 2022. Health Care Fraud | Federal Bureau of Investigation. [online] Available at: <> [Accessed 28 June 2022].
  • 2022. What Are the Most Common Types of Medicare Fraud? [online] Available at: <> [Accessed 28 June 2022].
  • Gupta, R., 2022. HEALTHCARE PROVIDER FRAUD DETECTION ANALYSIS. [online] Available at: <> [Accessed 28 June 2022].

About Author

Suhita Acharya

NYCDSA Bootcamp graduate with background in Environmental Sciences, previously worked professionally as a QA Chemist at Environmental Standards, Inc.
View all posts by Suhita Acharya >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI