Data Analysis on Healthcare to Detect Frauds

Posted on Sep 7, 2021
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


HealthCare fraud is a prevalent issue in America and it is severe consequences to consumers, increasing the average cost of healthcare across the board. In a country where HealthCare is already a major issue and is inaccessible to many Americans due to high costs, using big data and machine learning can be a critical solution to sniffing out Fraudsters and mitigating the problem.

From the National Care Anti-Fraud Association (NHCAA):

β€œIn 2018, $3.6 trillion was spent on health care in the United States, representing billions health insurance claims […]

A conservative estimate is 3% of total health care expenditures, while some government and law enforcement agencies place the loss as high as 10% of our annual health outlay, which could mean more than $300 billion.”

What does HealthCare Fraud look like?

The majority of health care fraud is committed by a small number of dishonest health care providers, we will see evidence of this further along this post.

Common Types

  • Billing for services that were never rendered
  • Billing for more expensive services or procedures that were never performed (β€œUpcoding”)
  • Performing services solely for the purpose of generating insurance payments
  • Falsifying a patient’s diagnosis and medical record to justify unnecessary tests

Introduction to Fraud Detection Process


  • 40k inpatient records for train data (diagnosis codes and reimbursement)
  • 520k outpatient records
  • 5.4k distinct hospitals labeled as potential fraud or not potential fraud
  • 9% potential fraud rate


  • Identify highly correlated variables to potential fraud providers
  • Compare values of highly correlated between fraud and non-fraud
  • Perform supervised learning on train dataset
  • Apply the model to predict fraud in test data

Data on Feature Engineering

A major challenge in this project was to consolidate the data sets into one data frame I could apply the supervised learning models to.

Goal: Each row was a unique service provider, and each column was a consolidated feature (average, count, range, etc)

Important new features:

  • Average Age
  • Number of Days Admitted
  • Number of Doctors (Attending/Operating/Other)
  • Amount of Beneficiaries (Patients)
  • Number of Claims
  • Number of Procedures
  • Total Number of UniqueΒ Diagnoses

Exploratory Data Analysis

I was curious about which of the features (existing/new) might serve to highlight any patterns regarding fraud detection. To start out with, I uncovered highly correlated features against the variable we wished to predict, "Potential Fraud"

Highly correlated data variables

  • Insurance Claim Amt Reimbursed (.5755)
  • Deductible Amount Paid (.5320)
  • Number of Days Admitted (.526)
  • Number of Procedures (.53)

In the following diagrams, we can see that although Potential Fraud service providers only make up ~9% of total hospitals they represent the overwhelming majority of "Days Admitted" and "Number of Claims" into the hospital.

Potential Fraud vs Non-Potential Fraud against Claim Count and Days Admitted

Data Analysis on Healthcare to Detect Frauds

Data Analysis on Healthcare to Detect Frauds Data Analysis on Healthcare to Detect Frauds


Data on Supervised Learning

SMOTE (Synthetic Minority Oversampling Technique)

As we can see in our sample of Potential Fraudsters to Non-fraudsters there is a severe imbalance in the minority class (fraudsters)









Intuitively, we know applying SMOTE can help us increase the number of fraudsters in the sample and balance the majority and minority class:

New Sample of Dependent Feature Variable


Gradient Boosting vs Random Forest on SMOTE sample

Accuracy Score for GB: .9058

Accuracy Score for RF: .9095

Note: Random Forest was also better at classifying both non-fraud and fraud cases

Feature Importance (Ranked)

  1. Insurance Claim Amount Reimbursed (.6)
  2. Number of Procedures (.2)
  3. Number of Days Admitted (.1)
  4. Amount of Claims (.04)
  5. Deductible Amount Paid (.02)
  6. Chronic Stroke (.01)
  7. Number of Patients (.01)
  8. Chronic Arthritis (.01)
  9. Number of Unique Diagnosis (.01)

Future investigations

With more time and resources, I would be interested in exploring the socioeconomic demographics of the patients that are targeted in fraudulent cases. Also, explore the various regions fraudulent service providers operate in and uncover and patterns related to those investigations.

Closing Remarks

As this is my Capstone project, I just want to thank the NYCDSA for this incredible and challenging experience. I have truly learned a lot and am excited to apply these new tools and frameworks in my career as a technologist and beyond!



About Author

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI