NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Capstone > Predicting Fraudulent Health Insurance Claims

Predicting Fraudulent Health Insurance Claims

Deborah Leong, Sam Nuzbrokh, Doug Devens and Aiko Liu
Posted on Jun 25, 2020

Introduction

Healthcare fraud is a large and pervasive problem in the US healthcare system. The National Healthcare Anti-Fraud Association estimates up to $300 Billion in fraud of a total of $3.6 Trillion in insurance reimbursements throughout the US healthcare system.  This fraud harms not only insurers but also individual citizens, the ratepayers for insurance companies, and the taxpayers that support the largest healthcare program in the country, Medicare.  And it can take many forms. 

Providers can bill for medical services that never took place and duplicate billing for services that were performed only once.  A provider can โ€˜unbundleโ€™ a claim with several diagnoses and procedures from a single medical visit into a set of claims spanning several office visits - to grab that sweet "facility fee" for each visit. Or they can upcode the severity of the diagnosis or procedure to charge the insurer more. 

Because of these many forms and the size of the problem, many states require insurers to have anti-fraud detection units in addition to criminal investigations by the Federal Bureau of Investigation.  However, by the False Claims Act, the insurance companies can recover their money if fraud is shown, and the Act also allows punitive charges to be imposed in civil court. 

Our project was to develop a machine learning model that could help insurance companies identify potentially fraudulent providers, reducing the number of random audits required and focusing resources on providers more likely to be fraudulent while minimizing the number of investigations into innocent providers (false positives.)

Data description and feature engineering to help detect fraud

There are four parties in any healthcare transaction.  An insurance company that is paying the majority of the cost.  Providers (e.g. hospitals and clinics) that are the businesses requesting reimbursement from the insurer for services performed on the patient.

There are physicians who actually perform the services and who are often employees of the providers, but in some cases may also be providers if they are a solo practice.  Finally, there are the patients (or beneficiaries) who receive the services and who pay premiums to insurers for coverage.  They will also pay a deductible to the provider, often a fraction of the total charges for service, and determined by the patientโ€™s insurance policy. 

In our dataset there are5410 providers that were labeled as either potentially fraudulent or not, with about 10% of the providers labeled as fraudulent, making this an imbalanced classification problem. The data provided were a little over 550,000 claim records, with reimbursement, procedures and diagnostic codes and attending and operating physicians, among other information. There were also demographic and medical information records (e.g. chronic illnesses) for the patients.

We first noticed the majority of claims had patients above 65, with a sharp increase in claims after 65 years of age due to automatic enrollment in Medicare.  This enrollment trend was reflected in the average number of chronic illnesses, which spiked at 65 also. We also observed that one gender (presumably women) was over-represented at ages above 85.

Network Analysis on fraud

To understand fraud networks, we must first gauge their size and extent. The below graphic shows the number of providers operating for a given number of states. The y-axis is on a log scale and looking at the count of providers operating in 1-3 states reflects the obvious. Most providers (clinics, hospitals) are local. They operate in one state or - if located on a border - might service one or two more.

However, as the given number of states a provider operates in increase, the relative proportion in the number of providers between โ€˜Noโ€ fraud and โ€œYesโ€ fraud decreases: Fraudulent providers are overrepresented in large medicare provider operations that span several states. 

fraud

Plot showing the count of providers by the number of states in which it has associated beneficiaries.

We examined network relationships between the providers, physicians, and patients, by creating an R Shiny tool for visualizing each state's provider-patient-physician networks on county levels. What we found was striking. Shown below are typical examples of a provider-patient network and a provider-physician network - in that order. The edges between each actor and a provider is weighted and sized according to the total reimbursements from all claims filed between the two. The legend is provided as well:

Actor Shape Color
Provider Square  Fraud
No Fraud
Unknown (Test)
Patient / Beneficiary Circle  
Physician / Doctor Triangle  
fraud

                      Provider-Patient Network for New Jersey, County 300

 

fraud

                      Provider-Physician Network for New Jersey, County 300

The providers in our medicare dataset - across states and counties - were much more likely to be linked to one another through shared beneficiaries than through shared physicians. This can be shown more clearly through a bipartite projection. A bipartite project takes a network with two types of actors (Provider-Physician/Patient) and projects onto a single type - establishing links through shared relationships from the full graph. Below, therefore, is a network of Providers linked together if they filed claims for the same Patient. 

 

fraud

                     Projected Provider Network through shared Patients

The above is likely a general hospital and its links to various smaller clinics. Below offers a stark contrast: A completely isolated network of providers - since there is no claim filed with the same doctor for two different providers. 

                     Projected Provider Network through shared Physicians


Network Feature Generation

Having gained fluency in manipulating network graphs with the igraph library in R, we moved on to analyzing the graphs through various metrics. There is a veritable menagerie of centrality, connectivity, and adjacency metrics that one could consider in application to network analysis. For our case, we limited it to four metrics for each type of projected bipartite graph (Provider-Patient-Provider, Provider-Physician-Provider). The four were:

  1. Degree: How many other providers is Provider X connected to?
  2. Betweenness: On how many shortest paths between providers does Provider X lie on?
  3. Average Nearest Neighbor (ANN): What is the average number of nearest neighbors for all providers directly connected to Provider X?
  4. Eigenvector Centrality: A measure of the influence of Provider X (Google's PageRank is based on this metric).

In total then, we had eight new features that varied in their predictive strength in our models. 


Duplication Networks of Fraud

As the next step in our analysis, we used networks to explore a class of fraud that's has a network structure from the start: duplication of claims. Though each claim ID is a primary key in our dataset, the provider, physician, and beneficiaries are not. Thus - especially in the outpatient data - we have a multitude of claims that seem to have been duplicated. To find these claims, we cast a wide net: we identify - as duplicate - claims with the same beneficiary having the same three diagnosis codes.

We filtered out the many all null entries which likely represented claims associated with patient visits to outpatient clinics with no diagnosis or procedure performed.  

Network Relationship  Interpretation Visualization
Unary  Provider X duplicates claim C internally
Binary Provider Y receives claim C from Provider X
Ternary Multiple arrows between nodes indicate multiple claims being traded
Implicative Our unknown provider in purple is at the center of three fraudulent providers, trading duplicated claims with each of them. A measure of "guilt by association". 


fraud

Largest duplication network found in our data - corresponds to NY-NJ-CT provider networks

Market Basket Analysis

We also performed a market-basket analysis, to determine what frequent combinations of chronic illnesses and specialties might be associated with fraudulent providers.  We did find that certain sets of chronic conditions did associate with each other.  For example, we found that diabetes and ischemic heart disease were strongly associated, which mirrors reality where diabetes is recognized medically as a strong risk factor for coronary artery disease (i.e. ischemic heart failure.) 

We also found associations of diabetes with kidney disease, which again is observed medically due to the degradation of the vasculature by diabetes.  Interestingly, we also found an association between fraudulent providers and the proportion of their patients that had diabetes and ischemic heart disease.

Fraud

Connection diagram from Market Basket Analysis showing connections among patientsโ€™ chronic illnesses

Other provider characteristics that were predictive of the likelihood of fraud were the number of days it took to resolve the claim (which closely tracked the length of stay for in-patient claims.)  One characteristic from this attribute was the predictive strength of a providerโ€™s range from a maximum to a minimum number of days that a claim lasted. 

Fraudulent providers had a wider range of claim durations, with claims lasting only 1 day to claims lasting to the maximum observed (35 days for in-patient, 21 days for out-patient.)  On the other hand, non-fraudulent providers were much more consistent in their claim duration. 

fraud

In-patient claim duration range, with fraudulent providers in red on left

We also observed this same trend of the fraudulent providers having wider ranges in the percentage of the claim that was covered by insurance.  We also observed that the average charges per claim and the average per-day charge of the claim (the total charge divided by the duration of the claim) were higher for fraudulent providers, on average.

fraud

Per-day charges average for providers, with fraudulent providers in blue on right

Cost Metrics: Model Evaluation and Selection based on Potential Fraud

If we assume that no model is perfect, we acknowledge that misclassification occurs and that each type of misclassification carries a cost.  In this project, the two possible misclassifications were failing to catch a fraudulent provider (false negative) and falsely accusing a provider of fraudulence (false positive).  The cost of not identifying a fraudulent provider is to let the theft of reimbursements continue and serve as an enticement for others to commit fraud. 

Alternatively, the cost of falsely accusing an innocent provider of fraud is certainly reputational damage, but also extra investigative costs and possibly legal costs as the provider fights the designation.  We believed this was a balancing act, and attempted to develop a model using realistic costs for investigations and legal disputes and to measure the amount of money that had been claimed by fraudulent providers by comparison. 

We attempted to maximize the number of claims (as a fraction of total claim dollars) and the ratio of the amount of money identified to the investigative expenses.  This penalizes both false positives since they represent extra cost and false negatives since they reduce the number of claims identified for recovery.  Statistically, we optimized models based on F1, the harmonic mean of precision and sensitivity.

Types of models

We examined multiple types of models, including unsupervised learning/classification, logistic linear regression, decision tree models, and boosted decision tree models.  The unsupervised classification models performed poorly, with reasonable identification of fraudulent providers but with 3 times as many falsely accused providers as fraudulent providers.  We did not pursue this further.  We show a summary table of the models evaluated on our cost model below:

Fraud

The table tracks right in terms of model performance according to our cost metrics. The first, logistic regression with a lasso penalty had the highest recall but misclassified 12 percent of non-fraudulent providers as fraudulent - taking up the majority of our investigative resources. The second contended is the Multi-Layer Perceptron - an implementation of a simple neural network - which lowered the False Positive rate at the expense of fewer true positives.

Final Model

We found the boosted tree models gave us the best results. Logitboost - a variant of AdaBoost that maximizes the binomial log-likelihood directly - gave great performance with respect to the final profit metric. Combining it with an MLP classifier through a logical AND operation boosted the results even further - driving down the resources allocated to investigating false positives to 13 percent - and delivering a final profit metric of 5.93. 

Our final model had a ratio approaching $6 of total reimbursements identified from fraudulent providers for every $1 in investigative costs. 

Finally, we note that while we identified only 60-70% of the fraudulent providers, we identified about 90% of the money billed by these providers, an indicator that our model was effective at identifying providers by size, with smaller providers more likely to escape.  The additional funds that could be recovered from these providers would offer diminishing returns on the further investment of investigative resources. We have bigger fish to fry. 

Conclusion

We have developed a set of models that are relatively effective at the identification of fraudulent providers, with metrics that take into account economic cost/benefit that approximate real tradeoffs. These models offer an attractive return on the investigative money invested, potentially allowing the insurer to reduce premiums for its policyholders. We found through multiple measures that fraudulent providers are likely to be larger and have broader networks, though some measures such as the proportion of deductibles as a fraction of total charges might be less dependent on provider size. 

Given these characteristics of the dataset and the models as they were developed, we were able to identify 90% of the claims made by fraudulent providers while maintaining reasonable investigative costs. 

 

About Authors

Deborah Leong

Deborah is a data scientist with 10+ years of domain expertise in Asset Management. She's a Certified Public Accountant with acute acumen for financial data analysis and an avid painter with natural intuition in pattern recognition. She believes...
View all posts by Deborah Leong >

Sam Nuzbrokh

Sam Nuzbrokh is a certified data scientist with a Master's in Space Engineering and a Bachelors in Theoretical Physics. He has 3+ years of data science, engineering, and research experience across satellite communication, engineering telemetry, and academic research....
View all posts by Sam Nuzbrokh >

Doug Devens

Doug Devens has a background in chemical engineering, with a doctorate in rheology of polymers. He has nearly 20 years of experience in medical device product development, with a dozen product launches. It is here he learned the...
View all posts by Doug Devens >

Aiko Liu

Aiko was born and raised in Taiwan. After college graduation, he came to U.S. and got his Ph.D. at Harvard University, specializing in geometry. After having done research at several top research universities for years, he switched gear...
View all posts by Aiko Liu >

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application