Data Analysis on Healthcare to Detect Frauds

Abhi Singh

Posted on Sep 7, 2021

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

HealthCare fraud is a prevalent issue in America and it is severe consequences to consumers, increasing the average cost of healthcare across the board. In a country where HealthCare is already a major issue and is inaccessible to many Americans due to high costs, using big data and machine learning can be a critical solution to sniffing out Fraudsters and mitigating the problem.

From the National Care Anti-Fraud Association (NHCAA):

“In 2018, $3.6 trillion was spent on health care in the United States, representing billions health insurance claims […]

A conservative estimate is 3% of total health care expenditures, while some government and law enforcement agencies place the loss as high as 10% of our annual health outlay, which could mean more than $300 billion.”

What does HealthCare Fraud look like?

The majority of health care fraud is committed by a small number of dishonest health care providers, we will see evidence of this further along this post.

Common Types

Billing for services that were never rendered
Billing for more expensive services or procedures that were never performed (“Upcoding”)
Performing services solely for the purpose of generating insurance payments
Falsifying a patient’s diagnosis and medical record to justify unnecessary tests

Introduction to Fraud Detection Process

Dataset

40k inpatient records for train data (diagnosis codes and reimbursement)
520k outpatient records
5.4k distinct hospitals labeled as potential fraud or not potential fraud
9% potential fraud rate

Process

Identify highly correlated variables to potential fraud providers
Compare values of highly correlated between fraud and non-fraud
Perform supervised learning on train dataset
Apply the model to predict fraud in test data

Data on Feature Engineering

A major challenge in this project was to consolidate the data sets into one data frame I could apply the supervised learning models to.

Goal: Each row was a unique service provider, and each column was a consolidated feature (average, count, range, etc)

Important new features:

Average Age
Number of Days Admitted
Number of Doctors (Attending/Operating/Other)
Amount of Beneficiaries (Patients)
Number of Claims
Number of Procedures
Total Number of Unique Diagnoses

Exploratory Data Analysis

I was curious about which of the features (existing/new) might serve to highlight any patterns regarding fraud detection. To start out with, I uncovered highly correlated features against the variable we wished to predict, "Potential Fraud"

Highly correlated data variables

Insurance Claim Amt Reimbursed (.5755)
Deductible Amount Paid (.5320)
Number of Days Admitted (.526)
Number of Procedures (.53)

In the following diagrams, we can see that although Potential Fraud service providers only make up ~9% of total hospitals they represent the overwhelming majority of "Days Admitted" and "Number of Claims" into the hospital.

Potential Fraud vs Non-Potential Fraud against Claim Count and Days Admitted

Data Analysis on Healthcare to Detect Frauds

Data on Supervised Learning

SMOTE (Synthetic Minority Oversampling Technique)

As we can see in our sample of Potential Fraudsters to Non-fraudsters there is a severe imbalance in the minority class (fraudsters)

Intuitively, we know applying SMOTE can help us increase the number of fraudsters in the sample and balance the majority and minority class:

New Sample of Dependent Feature Variable

Gradient Boosting vs Random Forest on SMOTE sample

Accuracy Score for GB: .9058

Accuracy Score for RF: .9095

Note: Random Forest was also better at classifying both non-fraud and fraud cases

Feature Importance (Ranked)

Insurance Claim Amount Reimbursed (.6)
Number of Procedures (.2)
Number of Days Admitted (.1)
Amount of Claims (.04)
Deductible Amount Paid (.02)
Chronic Stroke (.01)
Number of Patients (.01)
Chronic Arthritis (.01)
Number of Unique Diagnosis (.01)

Future investigations

With more time and resources, I would be interested in exploring the socioeconomic demographics of the patients that are targeted in fraudulent cases. Also, explore the various regions fraudulent service providers operate in and uncover and patterns related to those investigations.

Closing Remarks

As this is my Capstone project, I just want to thank the NYCDSA for this incredible and challenging experience. I have truly learned a lot and am excited to apply these new tools and frameworks in my career as a technologist and beyond!

Cheers,

Abhi

About Author

Abhi Singh

View all posts by Abhi Singh >

Data Analysis on The Mental Health Crisis

Python

Data Analysis on WallStreetBets and Its Impact On the Market

Python

Data Analysis on Our Happiness and Environmental Indicators

Python

Using Auction House Data to Evaluate Classic Cars

Student Works

Data Study on Top Manufacturing Companies by Income in 2020

No comments found.

Data Analysis on Healthcare to Detect Frauds

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

What does HealthCare Fraud look like?

Introduction to Fraud Detection Process

Data on Feature Engineering

Exploratory Data Analysis

Highly correlated data variables

Potential Fraud vs Non-Potential Fraud against Claim Count and Days Admitted

Data on Supervised Learning

SMOTE (Synthetic Minority Oversampling Technique)

New Sample of Dependent Feature Variable

Gradient Boosting vs Random Forest on SMOTE sample

Feature Importance (Ranked)

Future investigations

Closing Remarks

About Author

Abhi Singh

Related Articles

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Data Analysis on Healthcare to Detect Frauds

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

What does HealthCare Fraud look like?

Introduction to Fraud Detection Process

Data on Feature Engineering

Exploratory Data Analysis

Highly correlated data variables

Potential Fraud vs Non-Potential Fraud against Claim Count and Days Admitted

Data on Supervised Learning

SMOTE (Synthetic Minority Oversampling Technique)

New Sample of Dependent Feature Variable

Gradient Boosting vs Random Forest on SMOTE sample

Feature Importance (Ranked)

Future investigations

Closing Remarks

About Author

Abhi Singh

Related Articles

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!