Predicting Fraudulent Health Insurance Claims


Healthcare fraud is a large and pervasive problem in the US healthcare system. The National Healthcare Anti-Fraud Association estimates up to $300 Billion in fraud of a total of $3.6 Trillion in insurance reimbursements throughout the US healthcare system.  This fraud harms not only insurers but also individual citizens, the ratepayers for insurance companies, and the taxpayers that support the largest healthcare program in the country, Medicare.  And it can take many forms. 

Providers can bill for medical services that never took place and duplicate billing for services that were performed only once.  A provider can ‘unbundle’ a claim with several diagnoses and procedures from a single medical visit into a set of claims spanning several office visits - to grab that sweet "facility fee" for each visit. Or they can upcode the severity of the diagnosis or procedure to charge the insurer more.  Because of these many forms and the size of the problem, many states require insurers to have anti-fraud detection units in addition to criminal investigations by the Federal Bureau of Investigation.  However, by the False Claims Act, the insurance companies can recover their money if fraud is shown, and the Act also allows punitive charges to be imposed in civil court.  Our project was to develop a machine learning model that could help insurance companies identify potentially fraudulent providers, reducing the number of random audits required and focusing resources on providers more likely to be fraudulent while minimizing the number of investigations into innocent providers (false positives.)

Data description and feature engineering

There are four parties in any healthcare transaction.  There is an insurance company that is paying the majority of the cost.  There are providers (e.g. hospitals and clinics) that are the businesses requesting reimbursement from the insurer for services performed on the patient. There are physicians who actually perform the services and who are often employees of the providers, but in some cases may also be providers if they are a solo practice.  Finally, there are the patients (or beneficiaries) who receive the services and who pay premiums to insurers for coverage.  They will also pay a deductible to the provider, often a fraction of the total charges for service, and determined by the patient’s insurance policy. 

In our dataset there are5410 providers that were labeled as either potentially fraudulent or not, with about 10% of the providers labeled as fraudulent, making this an imbalanced classification problem. The data provided were a little over 550,000 claim records, with reimbursement, procedures and diagnostic codes and attending and operating physicians, among other information. There were also demographic and medical information records (e.g. chronic illnesses) for the patients.

We first noticed the majority of claims had patients above 65, with a sharp increase in claims after 65 years of age due to automatic enrollment in Medicare.  This enrollment trend was reflected in the average number of chronic illnesses, which spiked at 65 also. We also observed that one gender (presumably women) was over-represented at ages above 85.

Network Analysis

To understand fraud networks, we must first gauge their size and extent. The below graphic shows the number of providers operating for a given number of states. The y-axis is on a log scale and looking at the count of providers operating in 1-3 states reflects the obvious. Most providers (clinics, hospitals) are local. They operate in one state or - if located on a border - might service one or two more. However, as the given number of states a provider operates in increase, the relative proportion in the number of providers between ‘No” fraud and “Yes” fraud decreases: Fraudulent providers are overrepresented in large medicare provider operations that span several states. 

Plot showing the count of providers by the number of states in which it has associated beneficiaries.

We examined network relationships between the providers, physicians, and patients, by creating an R Shiny tool for visualizing each state's provider-patient-physician networks on county levels. What we found was striking. Shown below are typical examples of a provider-patient network and a provider-physician network - in that order. The edges between each actor and a provider is weighted and sized according to the total reimbursements from all claims filed between the two. The legend is provided as well:

Actor Shape Color
Provider Square  Fraud
No Fraud
Unknown (Test)
Patient / Beneficiary Circle  
Physician / Doctor Triangle  

                      Provider-Patient Network for New Jersey, County 300


                      Provider-Physician Network for New Jersey, County 300

The providers in our medicare dataset - across states and counties - were much more likely to be linked to one another through shared beneficiaries than through shared physicians. This can be shown more clearly through a bipartite projection. A bipartite project takes a network with two types of actors (Provider-Physician/Patient) and projects onto a single type - establishing links through shared relationships from the full graph. Below, therefore, is a network of Providers linked together if they filed claims for the same Patient. 


                     Projected Provider Network through shared Patients

The above is likely a general hospital and its links to various smaller clinics. Below offers a stark contrast: A completely isolated network of providers - since there is no claim filed with the same doctor for two different providers. 

                     Projected Provider Network through shared Physicians

Network Feature Generation

Having gained fluency in manipulating network graphs with the igraph library in R, we moved on to analyzing the graphs through various metrics. There is a veritable menagerie of centrality, connectivity, and adjacency metrics that one could consider in application to network analysis. For our case, we limited it to four metrics for each type of projected bipartite graph (Provider-Patient-Provider, Provider-Physician-Provider). The four were:

  1. Degree: How many other providers is Provider X connected to?
  2. Betweenness: On how many shortest paths between providers does Provider X lie on?
  3. Average Nearest Neighbor (ANN): What is the average number of nearest neighbors for all providers directly connected to Provider X?
  4. Eigenvector Centrality: A measure of the influence of Provider X (Google's PageRank is based on this metric).

In total then, we had eight new features that varied in their predictive strength in our models. 

Duplication Networks

As the next step in our analysis, we used networks to explore a class of fraud that's has a network structure from the start: duplication of claims. Though each claim ID is a primary key in our dataset, the provider, physician, and beneficiaries are not. Thus - especially in the outpatient data - we have a multitude of claims that seem to have been duplicated. To find these claims, we cast a wide net: we identify - as duplicate - claims with the same beneficiary having the same three diagnosis codes. We filtered out the many all null entries which likely represented claims associated with patient visits to outpatient clinics with no diagnosis or procedure performed.  

Network Relationship  Interpretation Visualization
Unary  Provider X duplicates claim C internally
Binary Provider Y receives claim C from Provider X
Ternary Multiple arrows between nodes indicate multiple claims being traded
Implicative Our unknown provider in purple is at the center of three fraudulent providers, trading duplicated claims with each of them. A measure of "guilt by association". 

Largest duplication network found in our data - corresponds to NY-NJ-CT provider networks

Market Basket Analysis

We also performed a market-basket analysis, to determine what frequent combinations of chronic illnesses and specialties might be associated with fraudulent providers.  We did find that certain sets of chronic conditions did associate with each other.  For example, we found that diabetes and ischemic heart disease were strongly associated, which mirrors reality where diabetes is recognized medically as a strong risk factor for coronary artery disease (i.e. ischemic heart failure.)  We also found associations of diabetes with kidney disease, which again is observed medically due to the degradation of the vasculature by diabetes.  Interestingly, we also found an association between fraudulent providers and the proportion of their patients that had diabetes and ischemic heart disease.

Connection diagram from Market Basket Analysis showing connections among patients’ chronic illnesses

Other provider characteristics that were predictive of the likelihood of fraud were the number of days it took to resolve the claim (which closely tracked the length of stay for in-patient claims.)  One characteristic from this attribute was the predictive strength of a provider’s range from a maximum to a minimum number of days that a claim lasted.  Fraudulent providers had a wider range of claim durations, with claims lasting only 1 day to claims lasting to the maximum observed (35 days for in-patient, 21 days for out-patient.)  On the other hand, non-fraudulent providers were much more consistent in their claim duration. 

In-patient claim duration range, with fraudulent providers in red on left

We also observed this same trend of the fraudulent providers having wider ranges in the percentage of the claim that was covered by insurance.  We also observed that the average charges per claim and the average per-day charge of the claim (the total charge divided by the duration of the claim) were higher for fraudulent providers, on average.

Per-day charges average for providers, with fraudulent providers in blue on right

Cost Metrics: Model Evaluation and Selection

If we assume that no model is perfect, we acknowledge that misclassification occurs and that each type of misclassification carries a cost.  In this project, the two possible misclassifications were failing to catch a fraudulent provider (false negative) and falsely accusing a provider of fraudulence (false positive).  The cost of not identifying a fraudulent provider is to let the theft of reimbursements continue and serve as an enticement for others to commit fraud.  Alternatively, the cost of falsely accusing an innocent provider of fraud is certainly reputational damage, but also extra investigative costs and possibly legal costs as the provider fights the designation.  We believed this was a balancing act, and attempted to develop a model using realistic costs for investigations and legal disputes and to measure the amount of money that had been claimed by fraudulent providers by comparison.  We attempted to maximize the number of claims (as a fraction of total claim dollars) and the ratio of the amount of money identified to the investigative expenses.  This penalizes both false positives since they represent extra cost and false negatives since they reduce the number of claims identified for recovery.  Statistically, we optimized models based on F1, the harmonic mean of precision and sensitivity.

We examined multiple types of models, including unsupervised learning/classification, logistic linear regression, decision tree models, and boosted decision tree models.  The unsupervised classification models performed poorly, with reasonable identification of fraudulent providers but with 3 times as many falsely accused providers as fraudulent providers.  We did not pursue this further.  We show a summary table of the models evaluated on our cost model below:

The table tracks right in terms of model performance according to our cost metrics. The first, logistic regression with a lasso penalty had the highest recall but misclassified 12 percent of non-fraudulent providers as fraudulent - taking up the majority of our investigative resources. The second contended is the Multi-Layer Perceptron - an implementation of a simple neural network - which lowered the False Positive rate at the expense of fewer true positives. We found the boosted tree models gave us the best results. Logitboost - a variant of AdaBoost that maximizes the binomial log-likelihood directly - gave great performance with respect to the final profit metric. Combining it with an MLP classifier through a logical AND operation boosted the results even further - driving down the resources allocated to investigating false positives to 13 percent - and delivering a final profit metric of 5.93. 

Our final model had a ratio approaching $6 of total reimbursements identified from fraudulent providers for every $1 in investigative costs. 

Finally, we note that while we identified only 60-70% of the fraudulent providers, we identified about 90% of the money billed by these providers, an indicator that our model was effective at identifying providers by size, with smaller providers more likely to escape.  The additional funds that could be recovered from these providers would offer diminishing returns on the further investment of investigative resources. We have bigger fish to fry. 


We have developed a set of models that are relatively effective at the identification of fraudulent providers, with metrics that take into account economic cost/benefit that approximate real tradeoffs. These models offer an attractive return on the investigative money invested, potentially allowing the insurer to reduce premiums for its policyholders. We found through multiple measures that fraudulent providers are likely to be larger and have broader networks, though some measures such as the proportion of deductibles as a fraction of total charges might be less dependent on provider size.  Given these characteristics of the dataset and the models as they were developed, we were able to identify 90% of the claims made by fraudulent providers while maintaining reasonable investigative costs. 


About Authors


Deborah Leong

Deborah is a data scientist with 10+ years of domain expertise in Asset Management. She's a Certified Public Accountant with acute acumen for financial data analysis and an avid painter with natural intuition in pattern recognition. She believes...
View all posts by Deborah Leong >
Sam Nuzbrokh

Sam Nuzbrokh

Sam Nuzbrokh is a certified data scientist with a Master's in Space Engineering and a Bachelors in Theoretical Physics. He has 3+ years of data science, engineering, and research experience across satellite communication, engineering telemetry, and academic research....
View all posts by Sam Nuzbrokh >

Doug Devens

Doug Devens has a background in chemical engineering, with a doctorate in rheology of polymers. He has nearly 20 years of experience in medical device product development, with a dozen product launches. It is here he learned the...
View all posts by Doug Devens >
Aiko Liu

Aiko Liu

Aiko was born and raised in Taiwan. After college graduation, he came to U.S. and got his Ph.D. at Harvard University, specializing in geometry. After having done research at several top research universities for years, he switched gear...
View all posts by Aiko Liu >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp