Healthcare Fraud Detection Using Machine Learning
Introduction
Healthcare fraud is a major concern for healthcare providers, insurers and governments worldwide. It involves illegal activities, such as billing for services that were not provided, duplicate submission of a claim for the same service, misrepresenting the service provided, charging for a more complex or expensive service than was provided or billing for a covered service when the service provided was not covered.
Detecting healthcare fraud is a challenging and resource-heavy task, particularly with large datasets. Fortunately, machine learning (ML) offers an effective solution by automating the detection process. It enhances the efficiency of identifying fraudulent activities while minimizing manual efforts. This automation leads to significant cost reductions and faster detection times. Ultimately, ML empowers healthcare organizations to better combat fraud and improve overall operational effectiveness.
In my capstone project, I leverage machine learning techniques to detect potential fraud in healthcare claims. By analyzing historical data, I aim to identify patterns that distinguish fraudulent activities from legitimate claims. The project involves comparing various models, including Logistic Regression, Decision Tree, Random Forest, XGBoost, LightGBM, and CatBoost, both with and without hyperparameter tuning. Through this approach, I strive to build an effective fraud detection system that can assist insurers in mitigating financial risks while ensuring legitimate claims are processed efficiently.
In this blog post, I’ll walk through my project objectives, data preprocessing, feature engineering, exploratory data analysis, model evaluation and key findings. Let’s dive into the fascinating world of machine learning-powered fraud detection.
Dataset :
The dataset used in this project is based on data uploaded to Kaggle by Rohit Anand Gupta. The dataset consists of four distinct sub-datasets: Inpatient, Outpatient, Beneficiary, and Fraud labels. Each one is described in the graph below.
Link: data set, https://www.kaggle.com/rohitrox/healthcare-provider-fraud-detection-analysis
Project Objectives:
- Identify fraudulent activities in healthcare claims and transactions. This helps reduce financial losses, improve efficiency and ensure ethical medical practices.
- Develop predictive models and evaluate model performance
- Build machine learning models to classify providers as fraudulent or non-fraudulent based on claim patterns. Assess models using metrics like precision, recall, F1-score and ROC_AUC to ensure reliable fraud detection
Data Preprocessing:
In fraud detection models, robust data preprocessing is critical to uncovering hidden patterns and ensuring reliable predictions. The graph outlines a structured approach to preparing raw data for analysis, focusing on mitigating common challenges like missing values, inconsistent scales, and categorical data handling.
Below is a breakdown of what’s involved for each of the key steps involved: handling missing values, label encoding, data aggregation, data scaling, and tuning adjustments for class weight:
This preprocessing pipeline ensures data quality, addresses class imbalance, and transforms raw data into actionable insights, key steps for building accurate fraud detection systems. By streamlining provider-level analysis and standardizing inputs, models can better distinguish legitimate claims from fraudulent ones, ultimately improving operational efficiency and reducing financial losses.
Features Engineering:
Feature engineering is a pivotal step in refining raw data into meaningful predictors for fraud detection. The graph below illustrates a strategic approach to creating and dropping features, ensuring the model focuses on the most relevant and reliable indicators while minimizing noise.
By prioritizing interpretable, high-quality features and eliminating unreliable or redundant ones, this process sharpens the model’s ability to detect subtle fraud signals. For example, ReimbursementPerDay might expose unusually high daily payouts, while dropping sparse features reduces overfitting risks. This curated feature set balances domain relevance with computational efficiency, laying the groundwork for a more accurate and actionable fraud detection system.
Exploratory Data Analysis:
Class Distribution Analysis:
Understanding the balance or imbalance of classes in a dataset is crucial for building effective fraud detection models. The graph below highlights the distribution of potential fraud cases, revealing a stark disparity between legitimate and fraudulent claims. This imbalance poses unique challenges and considerations for model training and evaluation.
The graph highlights a significant class imbalance in the dataset where the majority class, "Not Fraud," comprises 4,904 instances (90.65%), while the minority class, "Fraud," consists of only 506 instances (9.35%). This imbalance underscores a common challenge in fraud detection; machine learning models may become biased toward the majority class, leading to poor performance in identifying fraudulent claims. The graph serves as a crucial reminder that raw data often requires careful balancing techniques to enhance model effectiveness. Addressing this imbalance is not merely a technical step but a fundamental prerequisite for developing reliable, high-performing fraud detection systems that can accurately identify and mitigate financial risks.
Inpatient Stay Duration Analysis:
The analysis reveals a striking difference in inpatient stay durations between fraudulent and non-fraudulent claims. Fraudulent claims are associated with significantly longer hospital stays, averaging around 30 days, compared to just 10 days for legitimate claims. This disparity suggests that extended stays may serve as a potential red flag for fraud, often linked to inflated billing, unnecessary treatments, or fabricated medical records. By incorporating inpatient duration as a key fraud detection metric, machine learning models can prioritize investigating claims with unusually long stays, while insurers can implement targeted audits to curb fraudulent practices. This insight not only enhances fraud detection accuracy but also reinforces accountability in healthcare billing, ensuring resources are allocated efficiently and ethically.
Claim Duration Analysis:
In healthcare fraud detection, the duration of claims often serves as a critical indicator of irregularities. The graph below compares the density distribution of average claim durations between fraudulent and non-fraudulent cases, revealing distinct patterns that can help identify suspicious billing practices.
The distribution of claim durations reveals key differences between fraudulent and non-fraudulent claims. Non-Fraudulent Claims (No) peak around 0–5 days and sharply decline beyond 20 days, indicating that most are resolved efficiently. In contrast, fraudulent claims exhibit a broader spread, with a secondary peak around 0–20 days and a higher density extending up to 40 days. This pattern suggests that fraudsters may manipulate claim durations through tactics that include inflating treatment timelines, delaying processing to obscure anomalies or fabricating records for unnecessary services.
While not all long-duration claims indicate fraud, combining this metric with other features, such as reimbursement amounts and patient deductible amounts, enhance detection accuracy. By integrating claim duration as a key indicator, fraud detection models can flag suspicious patterns more effectively. Insurers can then implement targeted audits, ultimately reducing financial losses and ensuring ethical billing practices.
Reimbursed Amount Analysis:
In the fight against healthcare fraud, understanding reimbursement patterns is key to identifying suspicious activities. The graph below offers a revealing snapshot of how claim amounts vary across fraud classifications, providing actionable insights for detection strategies.
The "Reimbursed Amount Distribution by Fraud Status" graph reveals that potentially fraudulent claims tend to have significantly higher reimbursements and more extreme outliers compared to non-fraudulent claims. This suggests that fraudsters often target larger payouts, making high-value claims a key area for scrutiny. The wider distribution in the Potential Fraud category underscores the need for robust fraud detection systems to identify anomalies early and prevent financial losses in healthcare reimbursements.
Numerical Variables Analysis:
The graph below reveals strong positive relationships among inpatient claims, diagnoses, and procedures, indicating that higher inpatient claim volumes often involve multiple diagnoses and treatments.
A notable correlation between IP_Average_claim_duration and IP_Averagedaysinhospital suggests that longer hospital stays lead to extended claim durations. Additionally, IPAnnualDeductibleAmt shows a moderate correlation with claim duration, implying that higher deductibles may influence claim processing times. Outpatient claims exhibit weaker correlations overall, highlighting distinct patterns compared to inpatient claims. These insights help refine fraud detection strategies and improve healthcare claim management.
Classification Tasks: Model Building and Tuning:
Models were trained with and without tuning on selected features with rigorous evaluation metrics like accuracy,precision, recall, F1-score, and ROC-AUC to balance false positives (legitimate claims misflagged) and false negatives (fraud missed). This multi-model approach ensures reliable detection, enabling healthcare systems to curb financial losses, streamline audits, and uphold ethical standards by targeting high-risk providers.
The “Tuned Model Performance Comparison” graph shows that CatBoost achieves the best fraud detection performance with a ROC-AUC score of 0.9617, making it the most effective model for distinguishing fraudulent claims. While all models show high accuracy and ROC-AUC, variations in Precision and Recall indicate trade-offs in detecting fraud cases. Tree-based models (XGBoost, Random Forest, LightGBM) perform competitively. In contrast, Logistic Regression and Decision Tree exhibit slightly lower Recall, impacting fraud sensitivity. These results emphasize the importance of hyperparameter tuning and selecting the right model to enhance fraud detection accuracy and efficiency in healthcare claims.
CatBoost Performance Evaluation:
One of the best-performing models was the CatBoost model, achieving the highest ROC-AUC score among all attempted models. This indicates its superior ability to distinguish fraudulent from non-fraudulent claims.
Confusion Matrix
The low recall rate highlights a critical gap: there is a risk of nearly half of fraudulent claims going undetected. To address this risk, organizations must strategically balance precision and recall by adjusting classification thresholds, refining preprocessing and aligning priorities with operational realities. If resources are constrained, the tuned model (high precision) ensures auditors focus on high-confidence fraud cases. Conversely if minimizing undetected fraud is paramount, especially for high-cost claims, the untuned model (higher recall) becomes vital even with more false positives. Ultimately, the choice hinges on whether the priority is precision-driven efficiency or recall-driven risk mitigation. By tailoring approaches to organizational needs, healthcare systems can optimize detection frameworks to safeguard resources, while also adapting to evolving fraud challenges.
Feature importance (Tree based models) analysis:
Understanding which factors drive fraud detection is crucial for refining healthcare auditing strategies. The graphs below compare feature importance scores from three machine learning models—Decision Tree, Random Forest, and XGBoost—revealing which variables most influence predictions of fraudulent claims. These insights help prioritize audit criteria and uncover systemic vulnerabilities.
The graph above reveals that key features, such as extended hospital stays, higher deductible amounts, and claims volume are strong indicators of potential fraud. Inpatient and outpatient claims, along with diagnoses and procedures, are consistently important across models, with IP_Claims_Total(total number of inpatient claims) emerging as the top predictor. In this context, XGBoost performs the best, highlighting the significance of financial metrics, claim duration, and medical procedures in fraud detection. To improve model efficiency, prioritizing features like IP_Claims_Total (total number of inpatient claims), OP_Claims_Total (total number of outpatient claims) and diagnosis-related variables is recommended. Simultaneously addressing data quality issues will further enhance accuracy and reduce false positives.
Recommendations to Optimize Fraud Detection Models:
Address Class Imbalance: Use techniques like SMOTE or fraud-focused sampling to balance the data, improving the model's ability to detect fraudulent activities.
Threshold Adjustment & Optimization: Adjust the classification threshold to increase recall, prioritizing fraud detection while balancing false positives. Align the threshold with business objectives, especially when fraud investigations are resource-intensive, to ensure that potential fraud cases receive necessary attention.
Cost-Sensitive Training: Incorporate cost-sensitive learning by assigning a higher penalty to false negatives (missed fraud) than false positives. This approach ensures that the model prioritizes detecting fraudulent cases, reducing the risk of undetected fraud while maintaining an acceptable level of false positives.
Hybrid Approaches: In fraud detection, combining multiple models enhances performance by leveraging their strengths. Techniques like stacking integrate models such as Logistic Regression, Decision Trees, Random Forest and XGBoost with a meta-model like LightGBM to refine predictions. Rule-based systems can also be combined with machine learning to filter obvious fraud cases while ML models detect complex patterns. While using more than two models can improve accuracy, excessive complexity may lead to diminishing returns, increased computational cost and reduced interpretability. Practitioners typically start with a few models and expand only if the performance gains justify the added complexity.
Monitor Business Impact: Track key metrics like investigation efficiency (precision) and fraud detection rate (recall) to ensure model performance aligns with organizational priorities and provides meaningful, actionable insights.
By following these recommendations, organizations can refine fraud detection strategies, improve recall, and optimize resource allocation for investigations.
Future Work
While this capstone project lays a robust foundation for detecting healthcare fraud through machine learning, the journey toward a fully optimized scalable and ethically sound system is ongoing. Future efforts will focus on the following:
Deep Learning Models: Explore the use of deep learning models, which can better handle imbalanced data and improve the detection of complex fraud patterns.
Cost-Benefit Analysis: Conduct a cost-benefit analysis of the models evaluated, using a Custom_Cost function informed by domain expertise, to assess the economic implications of fraud detection strategies.
Feature Interpretation with SHAP or LIME: Utilize SHAP (SHapley Additive Explanations) or LIME (Local Interpretable Model-Agnostic Explanations) to gain deeper insights into how individual features influence fraud detection, enhancing transparency and trust in the model.
AutoML & Hyperparameter Optimization: Leverage AutoML tools to automate and continuously optimize hyperparameters, ensuring the models maintain peak performance over time.
Integration with BI Dashboards: Develop fraud detection dashboards in Power BI, Tableau, or Streamlit, enabling real-time monitoring of fraudulent claims and streamlining decision-making processes.
Quick Links