Data Analysis on Healthcare to Detect Frauds
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
HealthCare fraud is a prevalent issue in America and it is severe consequences to consumers, increasing the average cost of healthcare across the board. In a country where HealthCare is already a major issue and is inaccessible to many Americans due to high costs, using big data and machine learning can be a critical solution to sniffing out Fraudsters and mitigating the problem.
From the National Care Anti-Fraud Association (NHCAA):
“In 2018, $3.6 trillion was spent on health care in the United States, representing billions health insurance claims […]
A conservative estimate is 3% of total health care expenditures, while some government and law enforcement agencies place the loss as high as 10% of our annual health outlay, which could mean more than $300 billion.”
What does HealthCare Fraud look like?
The majority of health care fraud is committed by a small number of dishonest health care providers, we will see evidence of this further along this post.
- Billing for services that were never rendered
- Billing for more expensive services or procedures that were never performed (“Upcoding”)
- Performing services solely for the purpose of generating insurance payments
- Falsifying a patient’s diagnosis and medical record to justify unnecessary tests
Introduction to Fraud Detection Process
- 40k inpatient records for train data (diagnosis codes and reimbursement)
- 520k outpatient records
- 5.4k distinct hospitals labeled as potential fraud or not potential fraud
- 9% potential fraud rate
- Identify highly correlated variables to potential fraud providers
- Compare values of highly correlated between fraud and non-fraud
- Perform supervised learning on train dataset
- Apply the model to predict fraud in test data
Data on Feature Engineering
A major challenge in this project was to consolidate the data sets into one data frame I could apply the supervised learning models to.
Goal: Each row was a unique service provider, and each column was a consolidated feature (average, count, range, etc)
Important new features:
- Average Age
- Number of Days Admitted
- Number of Doctors (Attending/Operating/Other)
- Amount of Beneficiaries (Patients)
- Number of Claims
- Number of Procedures
- Total Number of Unique Diagnoses
Exploratory Data Analysis
I was curious about which of the features (existing/new) might serve to highlight any patterns regarding fraud detection. To start out with, I uncovered highly correlated features against the variable we wished to predict, "Potential Fraud"
Highly correlated data variables
- Insurance Claim Amt Reimbursed (.5755)
- Deductible Amount Paid (.5320)
- Number of Days Admitted (.526)
- Number of Procedures (.53)
In the following diagrams, we can see that although Potential Fraud service providers only make up ~9% of total hospitals they represent the overwhelming majority of "Days Admitted" and "Number of Claims" into the hospital.
Potential Fraud vs Non-Potential Fraud against Claim Count and Days Admitted
Data on Supervised Learning
SMOTE (Synthetic Minority Oversampling Technique)
As we can see in our sample of Potential Fraudsters to Non-fraudsters there is a severe imbalance in the minority class (fraudsters)
Intuitively, we know applying SMOTE can help us increase the number of fraudsters in the sample and balance the majority and minority class:
New Sample of Dependent Feature Variable
Gradient Boosting vs Random Forest on SMOTE sample
Accuracy Score for GB: .9058
Accuracy Score for RF: .9095
Note: Random Forest was also better at classifying both non-fraud and fraud cases
Feature Importance (Ranked)
- Insurance Claim Amount Reimbursed (.6)
- Number of Procedures (.2)
- Number of Days Admitted (.1)
- Amount of Claims (.04)
- Deductible Amount Paid (.02)
- Chronic Stroke (.01)
- Number of Patients (.01)
- Chronic Arthritis (.01)
- Number of Unique Diagnosis (.01)
With more time and resources, I would be interested in exploring the socioeconomic demographics of the patients that are targeted in fraudulent cases. Also, explore the various regions fraudulent service providers operate in and uncover and patterns related to those investigations.
As this is my Capstone project, I just want to thank the NYCDSA for this incredible and challenging experience. I have truly learned a lot and am excited to apply these new tools and frameworks in my career as a technologist and beyond!