HealthCare Fraud Detection

Posted on Sep 7, 2021


HealthCare fraud is a prevalent issue in America and it is severe consequences to consumers, increasing the average cost of healthcare across the board. In a country where HealthCare is already a major issue and is inaccessible to many Americans due to high costs, using big data and machine learning can be a critical solution to sniffing out Fraudsters and mitigating the problem.

From the National Care Anti-Fraud Association (NHCAA):

“In 2018, $3.6 trillion was spent on health care in the United States, representing billions health insurance claims […]

A conservative estimate is 3% of total health care expenditures, while some government and law enforcement agencies place the loss as high as 10% of our annual health outlay, which could mean more than $300 billion.”

What does HealthCare Fraud look like?

The majority of health care fraud is committed by a small number of dishonest health care providers, we will see evidence of this further along this post.

Common Types

  • Billing for services that were never rendered
  • Billing for more expensive services or procedures that were never performed (“Upcoding”)
  • Performing services solely for the purpose of generating insurance payments
  • Falsifying a patient’s diagnosis and medical record to justify unnecessary tests

Introduction to Fraud Detection Process


  • 40k inpatient records for train data (diagnosis codes and reimbursement)
  • 520k outpatient records
  • 5.4k distinct hospitals labeled as potential fraud or not potential fraud
  • 9% potential fraud rate


  • Identify highly correlated variables to potential fraud providers
  • Compare values of highly correlated between fraud and non-fraud
  • Perform supervised learning on train dataset
  • Apply the model to predict fraud in test data

Feature Engineering

A major challenge in this project was to consolidate the data sets into one data frame I could apply the supervised learning models to.

Goal: Each row was a unique service provider, and each column was a consolidated feature (average, count, range, etc)

Important new features:

  • Average Age
  • Number of Days Admitted
  • Number of Doctors (Attending/Operating/Other)
  • Number of Beneficiaries (Patients)
  • Number of Claims
  • Number of Procedures
  • Number of Unique Diagnoses

Exploratory Data Analysis

I was curious about which of the features (existing/new) might serve to highlight any patterns regarding fraud detection. To start out with, I uncovered highly correlated features against the variable we wished to predict, "Potential Fraud"

Highly correlated variables

  • Insurance Claim Amt Reimbursed (.5755)
  • Deductible Amount Paid (.5320)
  • Number of Days Admitted (.526)
  • Number of Procedures (.53)

In the following diagrams, we can see that although Potential Fraud service providers only make up ~9% of total hospitals they represent the overwhelming majority of "Days Admitted" and "Number of Claims" into the hospital.

Potential Fraud vs Non-Potential Fraud against Claim Count and Days Admitted


Supervised Learning

SMOTE (Synthetic Minority Oversampling Technique)

As we can see in our sample of Potential Fraudsters to Non-fraudsters there is a severe imbalance in the minority class (fraudsters)









Intuitively, we know applying SMOTE can help us increase the number of fraudsters in the sample and balance the majority and minority class:

New Sample of Dependent Feature Variable


Gradient Boosting vs Random Forest on SMOTE sample

Accuracy Score for GB: .9058

Accuracy Score for RF: .9095

Note: Random Forest was also better at classifying both non-fraud and fraud cases

Feature Importance (Ranked)

  1. Insurance Claim Amount Reimbursed (.6)
  2. Number of Procedures (.2)
  3. Number of Days Admitted (.1)
  4. Number of Claims (.04)
  5. Deductible Amount Paid (.02)
  6. Chronic Stroke (.01)
  7. Number of Patients (.01)
  8. Chronic Arthritis (.01)
  9. Number of Unique Diagnosis (.01)

Future investigations

With more time and resources, I would be interested in exploring the socioeconomic demographics of the patients that are targeted in fraudulent cases. Also, explore the various regions fraudulent service providers operate in and uncover and patterns related to those investigations.

Closing Remarks

As this is my Capstone project, I just want to thank the NYCDSA for this incredible and challenging experience. I have truly learned a lot and am excited to apply these new tools and frameworks in my career as a technologist and beyond!



About Author

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp