Machine Learning for Fraud Detection in Healthcare
Introduction
Healthcare fraud costs billions of dollars annually. Fraudulent providers overcharge insurance, submit inflated claims, or game the reimbursement system. Detecting this behavior at scale is crucial for protecting taxpayers and healthcare systems.
In this project, we built a supervised machine learning pipeline to identify high-risk providers based on claim-level data. The emphasis was not only on accuracy โ but also on recall, explainability, and handling class imbalance.
Objective
Our goal was simple:
โก๏ธ Flag potentially fraudulent healthcare providers using machine learning.
The primary challenge? Fraudulent providers make up only a small fraction of the dataset. This imbalance meant we had to be cautious of models that only predict the majority class โ non-fraud.
Data Overview
We used three main datasets:
-
Train.csv
: Includes provider IDs and a fraud label (PotentialFraud
) -
Inpatient.csv
: Inpatient claim-level data -
Outpatient.csv
: Outpatient claim-level data
Since Train.csv
lacked provider-level features, we merged and aggregated data from the other files. All features were constructed at the provider level.
โ ๏ธ Note: Some distributions in the dataset suggest synthetic or manufactured data โ a limitation to keep in mind.
Feature Engineering
We engineered 20+ features to reflect billing patterns, visit frequency, and reimbursement behavior:
-
Claim metrics:
Total_Reimbursed
,Avg_IP_Claim
,Avg_OP_Claim
-
Visit metrics:
Total_Visits
,Inpatient_Visits
,Outpatient_Visits
-
Ratios:
Claim_Ratio_IP_to_OP
,Visit_Ratio_OP_to_Total
-
Temporal patterns:
Monthly averages, standard deviations for claim amounts
These features aimed to distinguish honest providers from those likely gaming the system.
Exploratory Data Analysis (EDA)
Class Imbalance
The majority of providers were non-fraudulent. This imbalance could cause high accuracy but poor fraud detection. Therefore, we focused on metrics like recall and precision, not just accuracy.
Total Reimbursed โ Boxplot
Fraudulent providers tend to submit much higher reimbursement claims. Several extreme outliers suggest potential over-billing โ a classic fraud indicator.
Here, the median and upper quartile for fraudulent providers are notably higher than non-fraudulent ones โ a red flag for billing irregularities.
Total Visits vs. Reimbursed โ Scatter Plot
In this log-scaled scatterplot, most fraudulent providers appear in the upper-right quadrant, meaning high visit volume and high reimbursement โ further supporting our hypothesis.
Model Selection
We trained and compared the following supervised models:
-
Logistic Regression (baseline)
-
Gradient Boosting Classifier
-
XGBoost (tuned with class weights)
-
LightGBM (tuned with class weights)
-
Stacked Model (XGBoost + LightGBM)
To handle imbalance:
-
Used
class_weight='balanced'
orscale_pos_weight
-
Tuned thresholds (e.g., 0.65) to trade precision for recall
Model Comparison
Model | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|
Logistic Regression | 91.5% | โ | <50% | โ |
Gradient Boosting Class. | 92.3% | 76.0% | 56.0% | โ |
XGBoost (Tuned) | 92.3% | 57.6% | 67.3% | 62.1% |
LightGBM (Tuned) | 91.7% | 54.3% | 69.3% | 60.9% |
Stacked Model | 93.3% | 69.3% | 51.5% | 59.1% |
โ
Best Recall: LightGBM
โ
Best Balance: XGBoost
โ Stacked Model: High precision, but low recall (not ideal for fraud detection)
โก๏ธ Conclusion: LightGBM had the best recall, while XGBoost had the best overall balance. Stacked model offered higher precision but lower recall, making it less ideal for a fraud use case where recall is critical.
---
Threshold Tuning (XGBoost)
We adjusted the classification threshold from 0.5 to 0.65 for XGBoost:
-
Precision: 64.5%
-
Recall: 59.4%
-
F1 Score: 61.9%
This shift allowed the model to catch more fraudulent cases without significantly increasing false positives.
Confusion Matrices
Tuned Models
Threshold-Tuned XGBoost
Stacked Model
ROC & PR Curves
ROC Curve
Both XGBoost and LightGBM demonstrate strong AUC performance, indicating good separation between fraud and non-fraud.
Precision-Recall Curve
PR Curves highlight how XGBoost sustains higher precision across recall thresholds โ crucial when fraudulent cases are rare.
Feature Importance
-
Total_Reimbursed is the strongest fraud signal for both models.
-
XGBoost emphasizes
Avg_OP_Claim
andClaim Ratio OP to Total
. -
LightGBM leans toward visit ratios and inpatient claims.
These align with fraud intuition: excessive billing or unusually skewed claim distributions raise red flags.
SHAP for Explainability
Why SHAP?
SHAP (SHapley Additive Explanations) reveals how much each feature pushes a prediction toward fraud or not.
Log Odds for Classification
SHAP values in classification reflect log odds:
-
Positive value โ pushes prediction toward fraud
-
Negative value โ pushes toward non-fraud
-
Values are additive from a base prediction (the average)
SHAP Summary โ XGBoost
Key drivers:
-
Total_Reimbursed
-
Avg_OP_Claim
-
Claim_Ratio_OP_to_Total
SHAP Summary โ LightGBM
LightGBM reveals different nuances:
-
Total_Reimbursed still dominates
-
Variability in monthly claim patterns plays a major role
Conclusion
This project shows how supervised machine learning can improve fraud detection:
โ
Engineered meaningful, interpretable features
โ
Tuned multiple models and thresholds
โ
Used SHAP to make model outputs explainable
XGBoost provided the best tradeoff between precision and recall โ making it our recommended model for deployment.
Future Work
-
Integrate autoencoders or isolation forests as anomaly filters
-
Add time-series features to detect seasonal fraud trends
-
Use cost-sensitive learning to penalize missed fraud cases
-
Deploy models via real-time APIs or batch prediction systems
## GitHub Repository
[GitHub Link]