NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Student Works > A Data Investigation of Healthcare Insurance Fraud

A Data Investigation of Healthcare Insurance Fraud

David Green
Posted on Oct 15, 2021
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Background

According to Blue Cross Blue Shield data, approximately 3-10% of US healthcare spending or $68-$230 billion dollars are spent on fraudulent healthcare claims and management. This is especially detrimental for those who require government assistance through systems like Medicare. Government resources are already limited, and being able to alleviate pressures that occur from fraud may allow more freedom to help those in need.

Many of these fraudulent claims come in a variety of forms, including but not limited to: Billing for services that were not provided, duplicate submission of a claim for the same service, misrepresenting the service provided, charging for a more complex or expensive service than was actually provided, and even billing for a covered service when the service actually provided was not covered.

The Problem at Hand

This project is a proof of concept on using data science as a tool to attempt to improve upon existing fraud detection models, and subsequently save government programs like Medicare millions of dollars in insurance fraud management. You may find the code for this project on my Github.

Utilizing historical data from Medicare itself, exploratory data analysis and modeling was performed, and areas for improvement upon the existing system were brought to light. The Medicare data used for this analysis comes from a problem hosted on Kaggle, with the purpose of trying to identify problematic providers who consistently submit fraudulent claims to Medicare and those who are more likely to commit healthcare fraud. Understanding their behavioral patterns, along with how they relate to inpatient/outpatient care and claims can help the healthcare system save resources and devote them to people who need them.

Within the sample datasets, there are over 1300 unique providers and ~ 140,000 unique beneficiaries, whom submitted over 500,000 claims made between November 2008 and December 2009. Data categories consist of areas like deductible paid, reimbursement amounts, provider identifiers, medical history and other insurance related descriptors.

Exploring the Data

Capstone Project: A Data Investigation of Insurance Fraud

The dataset was conveniently broken up into categories that included things like the amount reimbursed per claim, and whether or not the claim was flagged for fraud. One of the first things that stuck out about the data once it was cleaned up, was that out of the total amount of money that was to go to claims during the sample year, more than half of those funds would have gone to fraudulent claims. The difference isn't large enough to be statistically significant, but it is definitely large enough to warrant further looking-into.

Differences Between Fraudulent and Non-Fraudulent

Capstone Project: A Data Investigation of Insurance Fraud
Capstone Project: A Data Investigation of Insurance Fraud

As we continue to look at the differences in the fraudulent and non-fraudulent claims, it is visually clear that the average claim amount, flagged as fraud, is larger than the average non-fraud claim. This would realistically make sense, as a provider would want to make as much money back on a fraudulent claim as possible, so on average you'd think they would be higher amounts.

Once again, these differences did not turn out to be statistically significant, due to the very large variance in the means of the two classifications, but the average claim amounts can be used later in analysis to determine the performance of this investigation. Similarly, the total amount of non-fraud claims was also visually higher than claims flagged as fraudulent.

Data of Procedure and Diagnosis Code

In the above figures, the proportion of fraud/non-fraud claims are organized by procedure code (left) and diagnosis code (right. Taking a look at the procedure code figure, it shows the top 10 codes based on money involved in the transaction. It is clear here that for these codes, the fraudulent claims are flagged more often than not. Given that these transactions have the most money involved, it makes sense, as if a crime is intentionally being committed, one would want to reap the most rewards from said transaction.

Looking at the diagnosis code figure, organized in a similar manner, non-fraud claims are more prevalent than vice-versa. This also kind of makes sense realistically, as there is likely more money to be made treating something than diagnosing it, therefore fraudulent claims may be less common in that regard.

If we take a look at the top 20 physicians, by code, plotted against claim count (fraud vs non-fraud), we can see that there are definitely a number of physicians with high numbers of fraudulent flagged claims. Its possible this is just due to the procedure they commonly do, or the particular field they are in, or even the practice they work for, but it does show that there are areas where fraud is clear and these areas can theoretically be weeded out systematically.

Although there is a very large variance in the means between fraud and non-fraud claims, when claims are plotted against providers as a whole (above), we can clearly see that fraudulent claims are buried in the mountain of verifiable ones. There isn't necessarily a cut and dry way to distinguish the claim types from one another, and as investigators, we have to be clever in our tool use.

Data on Feature Engineering

Utilizing NetworkX a ratio of Physician Connections to Patient Connections was established to determine if providers had a specific ratio that would flag them as committing fraud

Unsupervised learning tools like NetworkX can be used to establish connections between data points that aren't readily available. In this case, it was used to form a ratio of the amount of physicians associated with each provider, to the amount of patients associated with each provider. This ratio, visualized in the figures above, shows that for fraudulent claims, providers tend to be connected with more patients than they are providers, however this also holds true for non-fraud claims, just to a lesser extent.

Although this difference was not statistically significant, all the ratio's were plotted on a KDE (bottom left) and by claim type (bottom right) to see if some kind of bi-modal shape would form, and thus a ratio threshold established for determining fraud. Unfortunately, the figures only further established how hard the task at hand was to resolve. It is very clear how deeply ingrained and mixed in the fraudulent claims are with non-fraud claims, to the point where their density shapes almost completely overlap.

In addition to the NetworkX ratio, other feature engineering involved, Stratifying Y-s for equal distribution by setting up a Train/Test split set to 70/30. Oversampling of training data was done utilizing SMOTE, with model accuracy showing improvement when classes were balanced . Removal of unnecessary categories was done, as they were only used to group by, no more information was needed from them to run models. Things like Beneficiary Id, ClaimID, operating and specific attending physician/diagnosis codes and admission and discharge dates etc. ranked low in feature importance during the investigation.

Data Analysis: Logistic Regression & Random Forest

Visualization of Logistic Regression Machine Learning Model Results

Since the main feature of the data we are analyzing is binary, (e.g Fraud/Not Fraud), Logistic Regression was the first place to start, as this tool is good for handling binary data. After fine tuning the model to its best parameters using GridSearch, the model handled training data fairly well. It's detection of True Positives/Negatives was fairly high in both training and validation sets, when it came to the F-Score however, the model did not do amazing on unseen data.

This tends to happen when the model is good at finding patterns in the training data because it has a lot to work with, but not so great when working with a newer/smaller set of data, also known as overfitting. This type of issue can sometime be refined with certain techniques, like increasing the sample data size, or randomly removing features from the data programmatically, which forces the model to find new patterns in new data sets.

Model Data Analysis and Significance

If we take a look at the measures of importance taken from the LR model, the top four features that stood out were:

  1. Per Provider Average Insurance Claim Amount Reimbursed
  2. Insurance Claim Amount Reimbursed
  3. Per Claim Diagnosis Code 2 Average Insurance Claim Amount Reimbursed
  4. Per Attending Physician Average Insurance Claim Amount Reimbursed

These all make sense for the most part, as the things that we would think fraud would revolve around would be the claim amounts coming from the physicians and the individual providers. The model appears to be doing its job.

Despite the model currently being overfit, it performed well in most aspects. The rate of actual fraud that was flagged, and the amount of valid claims flagged were both fair and acceptable. The only thing of concern is the rate of false negatives, which may be a prominent source of error. Realistically, any false positive claims that are flagged by future versions of this model can be verified by providers on a case by case basis, although this number should probably come down a bit as well with further tuning.

Visualization of Random Forest Machine Learning Model Results

 

A Random Forest model was developed alongside the LR model for comparison, and it performed similarly. It did fairly well in all aspects after tuning with GridSearch, and yielded a slightly different yet fairly similar ranking of feature importance. It too, however, suffered from a lower F-Score when it came to validation, indicating it is also overfit as well.

Model Evaluations/Takeaway

This investigation resulted in some interesting takeaways of the current Medicare insurance fraud issue. Physician to patient ratio shows fraud is thoroughly mixed in with non-fraud, which very much so highlights the difficulty of problem at hand. Claims difficult to distinguish from one another, and any criminal activity seems to be well hidden amongst the masses of data submitted to Medicare. Some strengths and weaknesses of the model developed were that:

  • Accuracy, Sensitivity and Specificity are good but F1 is low in validation 
  • False (+) and False (-) values matter because these represent monetary value, so this effectiveness score is important 
  • False (+) flag rate is higher than False (-) but theoretically those can be resolved on a case by case basis
  • True (+) flag rate is relatively high, which means the model is catching fraud properly.
  • False (-) flag rate is concerning source of error, we don't want fraud to go unnoticed.

Cost/Benefit Data Analysis

While error rates are concerning, the big thing when it comes to government programs that support people in need, is money. How much money do these things cost us and what can be saved. If we break down the issue, and how this model handled it, this is what we can say.

Original Data

  • ORIGINALLY, Medicare found 38% of claims submitted for reimbursement were fraud. If we take the average for fraudulent claim reimbursements, thatโ€™s ~ $1300 a claim IF none of them got caught. 
  • In the training data we have ~ 558,211 total claims. 
  • 38% of total claims is ~ 212,120 claims. 
  • At an average of 1300 per fraud claims that's $275,756,234 saved if theyโ€™re caught
  • The other 62% of claims accounts for ~ 346,090 claims or (x$700 per claim) thatโ€™s ~ $242,263,700
  • At those rates, the amount saved with Medicareโ€™s current  fraud detection method is 
  • $275,756,234-$242,263,700 = ~$33,492,534
  • Under MY LR machine learning model
  • Total true negatives amount to 41.90% or 233,890 claims or $163,723,286 at $700 average. 
  • Adding in the false negatives or 8.10% or ~45,215 claims or $58,779,618 at $1300 average 
  • Totaling $222,502,904 paid out
  • Total True positive amount to 44.93% or 250,84 claims or $326,045,462 at $1300 saved
  • False positives account for 5.07% or 28,301 claims or $19,810,908 at $700 avg
  • If we add up the amount paid out thatโ€™s ~$242,313,812
  • The amount we saved catching fraud is ~$326,045,462 -$242,313,812 = ~$83,731,650

About Author

David Green

Certified in data science, confident working in R, Python, Git and SQL development. Skilled in applying machine learning techniques in data analysis of large datasets, alongside traditional statistical analysis triage
View all posts by David Green >

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application