Healthcare fraud detection: Capstone Project

Posted on Aug 28, 2021

Money lost due to healthcare fraud is a lot especially in America, because healthcare is not free there. According to the CDC, the United States spends over 3 trillion dollars per year on healthcare. At this scale, even minor inefficiencies can be costly. The National Health Care Anti-Fraud association estimates that at least 3% of spending is lost due to healthcare fraud. This amounts to more than $68 billion in unnecessary costs per year. These costs are passed on the the customer. On an individual basis, American citizens paid an avearge of $11,172 in healthcare related expenses in 2018, indicating a loss of $330 per person, per year. As such, healthcare fraud detection is a major concern.

Detecting Healthcare Fraud Capstone Project Overview

For my capstone project, I used machine learning to identify Medicare fraud using the Kaggle Healthcare Provider Fraud Detection Analysis dataset. The dataset includes multiple documents including 138k anonymized patient profiles, along with 40k inpatient and 500k claims records. Each provider is given a binary classification indicating if the institution was suspected of having committed fraud. My goal was to develop a high efficiency model for detecting new instances of fraud that minimizes the rate of false negatives.


Exploratory data analysis demonstrates that 9% of outpatient service providers are suspected of having committed Medicare fraud. For inpatient services, the problem is far worse (red): roughly 20% of providers are suspected of fraud.

Exploratory data analysis demonstrates that 9% of outpatient service providers are suspected of having committed Medicare fraud. For inpatient services, the problem is far worse (red): roughly 20% of providers are suspected of fraud.

Unlike the previous data sets we examined which were essentially single flat files, the Medicare fraud data is essentially a relational database. Each provider serviced many patients. As a result, I had to engineer a variety of new features to distill various patterns into a into a single data field - one per provider. As such, it was important to test variety of calculations. I first looked at the distribution of treatment costs per patient, with the expectation that fraudulent providers might overcharge. Indeed, I found that the providers suspected of fraud had a higher mean, as well as several outlier charges in the outpatient group that exceed the highest values in the non-fraudulent provider group.

I also found that the number of inpatient (left graph) and out patient (right graph) readmissions directly correlated with fraud (possibly fraudulent providers are in orange, non-suspected providers in blue), consistent with the idea that fraudulent providers might unnecessarily re-admit their patients.

Interestingly, the incidence of readmittance scaled linearly with the overall number of patients, which seemed to suggest that re-admission was not a relevant predictor or fraud. However, once the number of patients exceeds a certain threshold, it becomes almost certain that fraud will occur (demonstrated below using SVM). This could potentially be due to high throughput providers anticipating that fraudulent charges will not be discovered given their large volume of claims. Irrespective of the reason, it seemed that these features should all be useful predictors.

I also engineered several other fields including the type and frequency of chronic illnesses being treated, race, gender, and duration of hospital stay. All told, there were 39 fields that were initially used to train the models.

Model testing

I initially tested a support vector machine model, and conducted a grid search using linear, polynomial, or rbf kernel, and a range of penalty parameters. This demonstrated that a rather hard margin (C = 1000) was sufficient for category segregation, indicating that the fraudulent and non-fraudulent groups could be clearly separated for most predictors. In this example, repeated from above, the number of patients (x-axis) and hospital readmissions (y-axis) are strong predictors of fraud with a clear cutoff around the 100 patient cutoff.

SVM Decision Region Boundary

Ultimately, the SFM training and test scores were both closely correlated demonstrating a lack of overfitting, and a reasonable accuracy with R-squared values of 0.936 and 0.930, respectively. However, despite the high R-squared values, a confusion matrix showed that the minority class (Fraud) was only detected at 68% efficiency.

To see I could further improve on the SVM model, I trained a gradient boosting model. Even before optimization, the XGB model yielded a better fit with R-squared values of 0.980 for the training data and 0.946 for the test data.

However, despite the high R-squared values, a confusion matrix showed that the minority class (Fraud) was only detected at 68% efficiency.

The low efficiency of the initial model is likely due to the low relative abundance of the Fraud class, suggesting that oversampling should be used to balance the training data. Accordingly, I used SMOTE to resample the data, then retrained the model. While SMOTE didn’t improve the R-squared value of the model, it reduced the false negative rate by a remarkable 73%. Under this model, 91.4% of fraudulent charger can be accurately identified.

Over the course of optimizing the the XGB model, I tried removing apparently unnecessary feature such as chronic conditions like osteoporosis or stroke, and race-related patient information. Though only 12-16 features were of primary importance, it became clear that the low importance features are critical for reducing false negative calls. This observation offers insight into some of the more important criteria for identifying fraud.


The identification of 90% of fraudulent charges should provide a substantial savings for insurers, and by extension, customers. In this dataset, the average claim is $9605 for inpatient procedures among the providers suspected of fraud but only $3363 for non suspicious providers. Likewise, charges are $473 and $262 respectively, for outpatient procedures. Assuming that an investigation takes 5 hours of work by a single investigator making $20/hour, the cost is roughly $100 per investigation. Given the potentially low cost of an investigation and the high cost of fraud, insurers are highly incentivized to conduct investigations, especially for inpatient cases.

Fraudulent providers also tend to submit 29.5% more claims than non-fraudulent counterparts (20,623 versus 15,993, respectively). As there are relatively fewer fraudulent providers (506) relative to the non-fraudulent providers (4904), the claims per provider is 12 times higher for suspicious provider group (40.7 cases/fraudulent provider divided by 3.26). Using the assumption that 10% of the fraudulent provider’s claims are spurious, there are 2063 cases of fraud documented here, amounting to roughly $20M per year (2063 cases * $9,605/case).

By extension, applying the XGB model to detect fraud at 90% efficacy should result in a savings of $17M/year minus the cost of the investigation (2063 * $100 = $206k) which is comparatively insignificant. Even in the case of inpatient investigations where the savings per case is lower, the increased number of cases (104k) indicates a potential net savings of $3.9M after the total savings of $4.9M (10% of 104k cases * $473/case) and investigation costs of $1M.

Major insights

Our modeling suggest that insurers can fight insurance fraud by closely monitoring:

  • Cases of extended or repeated hospital stays
  • Treatment of certain chronic illnesses
  • Monthly total billing that exceeds threshold values
  • Providers with excessive patient throughput
  • Individuals who receive unusually high medical bills

Apply for the Upcoming NYC Data Science Bootcamp

The first step in becoming a data scientist is to complete your Data Science Bootcamp Application. Just click the button to apply. It's free and will only take you about 5 minutes.

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

About Author

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI