Machine Learning for Fraud Detection in Healthcare

hderouen1

Posted on Apr 4, 2025

Introduction

Healthcare fraud costs billions of dollars annually. Fraudulent providers overcharge insurance, submit inflated claims, or game the reimbursement system. Detecting this behavior at scale is crucial for protecting taxpayers and healthcare systems.

In this project, we built a supervised machine learning pipeline to identify high-risk providers based on claim-level data. The emphasis was not only on accuracy — but also on recall, explainability, and handling class imbalance.

Objective

Our goal was simple:
➡️ Flag potentially fraudulent healthcare providers using machine learning.

The primary challenge? Fraudulent providers make up only a small fraction of the dataset. This imbalance meant we had to be cautious of models that only predict the majority class — non-fraud.

Data Overview

We used three main datasets:

Train.csv: Includes provider IDs and a fraud label (PotentialFraud)
Inpatient.csv: Inpatient claim-level data
Outpatient.csv: Outpatient claim-level data

Since Train.csv lacked provider-level features, we merged and aggregated data from the other files. All features were constructed at the provider level.

⚠️ Note: Some distributions in the dataset suggest synthetic or manufactured data — a limitation to keep in mind.

Feature Engineering

We engineered 20+ features to reflect billing patterns, visit frequency, and reimbursement behavior:

Claim metrics:
Total_Reimbursed, Avg_IP_Claim, Avg_OP_Claim
Visit metrics:
Total_Visits, Inpatient_Visits, Outpatient_Visits
Ratios:
Claim_Ratio_IP_to_OP, Visit_Ratio_OP_to_Total
Temporal patterns:
Monthly averages, standard deviations for claim amounts

These features aimed to distinguish honest providers from those likely gaming the system.

Exploratory Data Analysis (EDA)

Class Imbalance

The majority of providers were non-fraudulent. This imbalance could cause high accuracy but poor fraud detection. Therefore, we focused on metrics like recall and precision, not just accuracy.

Total Reimbursed – Boxplot

Fraudulent providers tend to submit much higher reimbursement claims. Several extreme outliers suggest potential over-billing — a classic fraud indicator.

Here, the median and upper quartile for fraudulent providers are notably higher than non-fraudulent ones — a red flag for billing irregularities.

Total Visits vs. Reimbursed – Scatter Plot

In this log-scaled scatterplot, most fraudulent providers appear in the upper-right quadrant, meaning high visit volume and high reimbursement — further supporting our hypothesis.

Model Selection

We trained and compared the following supervised models:

Logistic Regression (baseline)
Gradient Boosting Classifier
XGBoost (tuned with class weights)
LightGBM (tuned with class weights)
Stacked Model (XGBoost + LightGBM)

To handle imbalance:

Used class_weight='balanced' or scale_pos_weight
Tuned thresholds (e.g., 0.65) to trade precision for recall

Model Comparison

Model	Accuracy	Precision	Recall	F1 Score
Logistic Regression	91.5%	—	<50%	—
Gradient Boosting Class.	92.3%	76.0%	56.0%	—
XGBoost (Tuned)	92.3%	57.6%	67.3%	62.1%
LightGBM (Tuned)	91.7%	54.3%	69.3%	60.9%
Stacked Model	93.3%	69.3%	51.5%	59.1%

✅ Best Recall: LightGBM
✅ Best Balance: XGBoost
❌ Stacked Model: High precision, but low recall (not ideal for fraud detection)

➡️ Conclusion: LightGBM had the best recall, while XGBoost had the best overall balance. Stacked model offered higher precision but lower recall, making it less ideal for a fraud use case where recall is critical.

---

Threshold Tuning (XGBoost)

We adjusted the classification threshold from 0.5 to 0.65 for XGBoost:

Precision: 64.5%
Recall: 59.4%
F1 Score: 61.9%

This shift allowed the model to catch more fraudulent cases without significantly increasing false positives.

Confusion Matrices

Tuned Models

Threshold-Tuned XGBoost

Stacked Model

ROC & PR Curves

ROC Curve

Both XGBoost and LightGBM demonstrate strong AUC performance, indicating good separation between fraud and non-fraud.

Precision-Recall Curve

PR Curves highlight how XGBoost sustains higher precision across recall thresholds — crucial when fraudulent cases are rare.

Feature Importance

Total_Reimbursed is the strongest fraud signal for both models.
XGBoost emphasizes Avg_OP_Claim and Claim Ratio OP to Total.
LightGBM leans toward visit ratios and inpatient claims.

These align with fraud intuition: excessive billing or unusually skewed claim distributions raise red flags.

SHAP for Explainability

Why SHAP?

SHAP (SHapley Additive Explanations) reveals how much each feature pushes a prediction toward fraud or not.

Log Odds for Classification

SHAP values in classification reflect log odds:

Positive value → pushes prediction toward fraud
Negative value → pushes toward non-fraud
Values are additive from a base prediction (the average)

SHAP Summary – XGBoost

Key drivers:

Total_Reimbursed
Avg_OP_Claim
Claim_Ratio_OP_to_Total

SHAP Summary – LightGBM

LightGBM reveals different nuances:

Total_Reimbursed still dominates
Variability in monthly claim patterns plays a major role

Conclusion

This project shows how supervised machine learning can improve fraud detection:

✅ Engineered meaningful, interpretable features
✅ Tuned multiple models and thresholds
✅ Used SHAP to make model outputs explainable

XGBoost provided the best tradeoff between precision and recall — making it our recommended model for deployment.

Future Work

Integrate autoencoders or isolation forests as anomaly filters
Add time-series features to detect seasonal fraud trends
Use cost-sensitive learning to penalize missed fraud cases
Deploy models via real-time APIs or batch prediction systems

## GitHub Repository

[GitHub Link]

About Author