Heart Cures: Predicting Congestive Heart Failure
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Introduction & Motivations
What if humans could battle disease before it even occurred?
Indeed, disease prediction is one of the foremost machine learning innovations coming out of the medical field right now. But there are several challenges along the way. Tying medical records for individuals across doctor and hospital visits in a holistic, standardized, and compliant way is difficult enough. On top of that, there is the difficulty of leveraging the technologies available for accurate record and image recognition on those medical records and tying in centuries of medical research to reach conclusive decisions.
Separately managing predictive outcomes with patients in a practical and cost-efficient manner is no easy feat. Each one of these goals can take a team or company a decade to achieve, though we consider them interrelated and inevitable, as the health and medical fields are ripe for innovation.
With this project, we took the first steps into the convoluted medical record space, extracting and aggregating patient-level data. Our goal was to create a predictive model that can accurately identify patients with the presence of Congestive Heart Failure, one of the main causes of mortality in the developed world.
It must be noted that this is just the beginning of creating a machine-learning-based engine that could accurately predict the inherited presence and/or future development of hundreds of diseases. The motivation for this work is to give doctors and healthcare professionals a low-cost tool that can spot the early onset of diseases while giving patients’ more proactive means to prevent or delay critical illnesses.
Our project shows that:
- There is a critical need for standardization and centralization of medical records
- Modeling for disease prediction requires expertise in medical research and data science
- Conflicts exist between current health policies and scientific research that must be resolved with collaboration and technological tools to clear obstacles for innovation in the medical field
The medical and healthcare industries are ripe for innovation. The U.S. healthcare industry is estimated to be a $14B industry in 2019, with a CAGR of 28.3%, and to grow to $50.5B by 2024). At the same time, patient registries and healthcare costs are rising in an increasingly aging society. Given the circumstances, medical professionals must work intelligently and efficiently to keep costs low while improving patient care.
We believe matching the domain expertise of medical professionals with the advances in machine learning can help the medical industry evolve and rapidly adapt to this reality.
Why Congestive Heart Failure?
The CDC cites that:
“About 5.7M adults in the United States have heart failure, one in nine deaths in 2009 included heart failure as contributing cause, about half of people who develop heart failure die within five years of diagnosis, and heart failure costs the nation an estimated $30.7B each year. This total includes the cost of healthcare services, medications to treat heart failure, and missed days of work.”
Moreover, congestive heart failure is notoriously difficult and expensive to diagnose as most symptoms overlap with many other common conditions, such as diabetes, coronary artery disease, stroke, and high blood pressure. The most common symptoms are difficulty breathing, general tiredness, rapid weight gain/loss, swelling stomach and limbs, and sometimes heart attack.
The main test indicators for identification are chest X-rays, Electrocardiograms, and laboratory blood tests. Within lab blood tests, the indicators for which we have access to for this project, there are 6 common blood gases, 12 blood chemistry, and 12 hematology measurements.
Our data source for this project was the MIMIC-III database hosted by Physionet. “MIMIC-III (Medical Information Mart for Intensive Care III) is a large, freely-available database comprising de-identified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012.” The database was developed by the MIT Lab for Computational Physiology.
It consists of 26 tables holding patient’s (anonymized) demographics, vital signs, laboratory tests, medications, diagnoses, and more. Although this database is publicly accessible, gaining access requires an eight-module training course ensuring users adequately comply with HIPAA, U.S. health record privacy policies, human subject training, and specimen research training standards.
Extracting a usable data set for disease prediction required intimate knowledge of the content hosted on each table, the relationships (primary keys/foreign keys) and types of relationships (one-to-one vs. one-to-many) of entities across the database, as well as the definitions associated to medical standards such as ICD-9 codes relating to diagnosed conditions or blood test measurement types.
We pivoted and aggregated blood test measurements grouped by individual patients, joined the diagnosis tables, and merged demographic information from admissions and patients tables. As a result, we extracted our dataset with roughly 44,000 patients, our “observations,” each with 115 attributes. Fifteen of these attributes were demographic; i.e. numeric (age), and categorical (ethnicity, gender, language spoken, etc.).
The other hundred attributes were a reduced 25 (the most common out of about 700 possible) blood test measurements and their minimum, maximum, mean, and standard deviation numeric values by patient (25 measurements x 4 measures each = 100 features).
Once we tied the presence of congestive heart failure to these patients, we had our binary target variable; designating ‘1’ for patients having congestive heart failure, and ‘0’ for those without. All together 9,829 (25.5%) of the adults in our data set had CHF, giving us a baseline accuracy of 74.5% assuming the majority class.
One aspect of HIPAA compliance and data privatization that presented a challenge was identifying patient’s age from the dataset. To comply with HIPAA, patients under 15 and over 85 years of age must be entirely anonymized and their age records obscured within those age ranges. Obviously this presents a notable risk to our model, and an example where public policy and innovation are in conflict.
The database documentation mentioned that all date figures (i.e. “Date of Birth,” “Date of Death,” all admission dates, etc.) were shifted into the future to ensure proper anonymization of patient records; however, the derived age of patient was preserved. Age could be accurately derived, calculated as “shifted date of first admission minus shifted DOB,” with the exceptions of patients under 15 (where these two dates are the same; calculated as 0) and patients over 85 (where these two dates are calculated as 300 years apart).
These exceptions are likely due to legal barriers set for requiring proper consent. We were given info that patients over 85 averaged 91.5 years old, so all such patients were imputed as ‘92’ years of age. While this direction allowed a method of meaningful imputation, the age ranges below 15 and above 85 presented notable risks to our model.
Once we calculated age from the date fields, we proceeded to remove variables from our dataset that introduced bias. For example, “Expiration Flag” and “Date of Death” both represent a bias of death (an unfortunately common result of CHF), so these fields were removed to prevent this bias in our model. “Date of Birth” and “Date of Admission” were randomized to future dates; so these features were removed as they were unnecessary once patients’ age was extracted from them as a separate field.
Certain hematology blood measurements, such as Hematocrit (ratio of the volume of red blood cells to the total volume of blood), Mean Corpuscular Hemoglobin Concentration (MCHC: the average concentration of hemoglobin in your red blood cells), and Red Cell Distribution Width (RDW: the range in the volume and size of your red blood cells) are blood tests calculated as mean or percentage measurements.
Therefore, these measurements could not be correctly aggregated to mean or standard deviation calculations and had to be removed as features, though their valid maxima and minima aggregate measurements were retained for modeling.
The final cleaning procedure was recategorizing our demographic fields. Language, ethnicity, and religion fields, unsurprisingly, had longtails that would introduce unnecessary sparsity in our modeling phase. For that reason, most of these categories were cut off and reduced to “Other” where deemed too sparse to stand on their own.
‘Marital Status’ had numerous categories, so we combined similar values such as “Divorced,” “Widowed,” and “Separated” to a combined “Post-Marriage” field to ensure more even sampling with the assumption that all three life events create a similarly stressful lifestyle on an individual and both metaphorically and physically impact on the heart.
Missingness & Imputation
Fortunately, all demographic (age, ethnicity, marital status, etc.) data were present, so no imputation was required for those fields. All blood measurements had some degree of missing values; most with only a few hundred to a few thousand blood measurement values missing (less than 10% of patients). However, five blood measurements (20 of the 100 fields) had about 10,000 values missing (roughly 25% of patients).
Some measurement values were labeled as missing values but actually meant abnormal values (such as ‘>10’, ‘<1’,etc.). In this case, the missing values were imputed by the boundary value. Having the foresight to our modeling phase that we would use logistic regression and gradient descent, which require no missing values, and tree-based classification models that can still decision with missingness, we opted to split our dataset into two versions here. All cleaning decisions made up until this point would be used for tree-based models.
The dataset for logistic regression and gradient descent required these blood measurement values be imputed.
For our logistic/gradient descent data set, we took special care to understand why all blood measurements might be missing to identify the most appropriate imputation method, knowing this will certainly have an effect on our model accuracy down the road. After careful consideration and research, we found that most doctors in the U.S. would opt to “overtest” and run analyses on any relevant measurement for a sick individual to “play it safe” and ensure that they cannot be held liable for misdiagnosis or malpractice.
In other healthcare systems, the opposite would be likely. Therefore, we concluded that missing blood measurements are missing at random (MAR), and not missing completely at random (MCAR), with the assumption that a patient is presumed to have normal, healthy values if they are not tested. This means that we would rely on medical research for what defines a normal, healthy minimum, maximum, mean and standard deviation for blood measurement to impute missing values instead of the hospital populations’ values which would likely skew towards those of ill individuals.
With additional research, we found that four blood measurements (Creatinine, Hematocrit, Hemoglobin, and Red Blood Cell count) even had different “normal” levels for adult males vs. adult females. We therefore imputed all missing values as those of normal, healthy individuals; imputing by respective gender-associated values where known differences occurred. Once our data set was extracted, cleaned, and imputed we were ready to model.
Our first important consideration was that we should optimize our models toward sensitivity (or recall) over accuracy. In a confusion matrix, used for evaluating classification models, the accuracy score is the percent of all observations predicting both positive and negative results (having or not having CHF) correctly over all observations. Sensitivity, on the other hand, is the percent of all positive observations (patients having CHF) that are predicted correctly vs. incorrectly.
We consider that, in the field of disease prediction, it is more important to build a sensitive model because we are more concerned with predicting when patients have CHF,and less concerned with when they do not. In other words, as a medical professional and patient, it is better to predict someone has CHF when they do not and run additional tests to find out they really don’t rather than predicting they don’t have CHF when they really do. That’s because if they are undiagnosed and untreated, they likely will suffer a great deal more than if they are subject to an additional test.
Our first model was regularized logistic regression. In logistic regression, we use an s-shaped sigmoid curve to assess the probability that an individual will have CHF (‘1’) or will not (‘0’). We fit our feature coefficients that represent the degree to which each blood measurement or categorical feature affects the patients’ likelihood to have CHF, similar to linear regression, however with logistic regression our coefficients are in the exponent of the denominator of the sigmoid function instead of a continuous linear line.
Using the sigmoid function and thereby the inverse logarithmic scale enables us to bind our result between ‘0’ and ‘1,’ allowing us to classify our target. We thus identify our best feature coefficients and model hyperparameters that minimize inaccurate predictions by minimizing our “log loss” as our loss function.
To ensure we can do this correctly, we had two quick data preparations. First, we needed to “dummify” our categorical (non-numeric) variables (i.e. Religion, Ethnicity, etc.) which allows the model to understand the presence of a non-numeric value in numerical terms. Dummification creates a separate column for the presence of a value where ‘1’ represents ‘yes’ and ‘0’ represents ‘no.’
Once all features were in numerical terms, we then had to scale all features. Min-max scaling takes the magnitude of variables against one another out of the equation by taking the range of each variable and fixing them between 0 and 1 while maintaining the distribution of individual variables. Consider a blood measurement that has one thousand times the effect than another at predicting CHF but is only one thousandth the scale of the other.
Without scaling, the model would not be able to identify the degree at which the micro fluctuations of the first variable could predict the presence of CHF due to its relatively small magnitude. Min-max scaling allows for the ranges of these two variables to be the same, enabling our model to identify fluctuations of any variable equally.
With our data ready for logistic regression, we had to tune our model for the best sensitivity. Before tuning, we ran a base model with no hyperparameters as a baseline for comparison. Our accuracy was 80.1% and sensitivity was 42.3%. The three most important parameters we had to tune were “C” (our penalization term), “Score” (i.e. sensitivity, accuracy, etc.), and the cutt-off value (the point on the sigmoid curve we choose to decide if a patient is positive or negative).
Penalization allowed us to “shrink” our feature coefficients to prevent overfitting and to remove multicollinearity from our model. Optimizing toward sensitivity score simply required we choose a “ROC-AUC” scoring method instead of the default “Accuracy.”
To find the optimal sensitive C value, we used a grid search 5-fold cross validation and found that our optimal model was L1 regularization with a C penalty of 500. This improved our accuracy by 2pp and sensitivity by 1pp. Our last optimization method was to shift the cut-off value of the sigmoid curve. Logistic regression defaults to a 50-50 split as the cut-off point, which makes sense in the case that ‘Yes’ and ‘No’ are equally likely like a fair coin flip.
However, our dataset is imbalanced with about 25% of patients with CHF; so adjusting our cut-off point to predict ‘Yes’ with a higher probability than 50% enables our sensitivity to rise. With a little help from statistics, we identified that shifting the probability threshold to 26.3% shot our sensitivity up to 75.0% (a 42.3pp increase) while only reducing accuracy by 0.5pp (to 79.6%)!
One important benefit of Lasso (L1) regularization is that its penalization method shrinks coefficients that do not explain significant variance in the model to 0; so the magnitude of the remaining coefficients give us a ranking of feature importance (extreme negative/positive being the strongest).
Interpreting our model results, Red blood cell count (standard deviation and max) and MCH (standard deviation) blood tests ranked consistently strongest in feature importance. This makes sense because MCH measures the level of oxygen (hemoglobin) red blood cells carry from the lungs to the heart and brain. A wildly fluctuating range of hemoglobin with a consistently low count of red blood cells creates an inefficiency of oxygen retention in the blood which then causes the heart to inflame and not pump blood to the brain and body properly.
One last optimization we tried with our logistic regression model was rebalancing our dataset with NearMiss, which essentially under-samples the majority class (‘No’ CHF) and creates a 50/50 split between ‘Yes’ and ‘No.’ This pushed our maximum sensitivity up to 79.7% with a cut-off threshold of 45%, L1 regularization, and a C penalization of 210.
Our next model was Stochastic Gradient Descent (SGD), which in summary descends upon the global minimum of accurate classification as its loss function. SGD, as opposed to standard gradient descent, descends upon the minimum in a shuffling manner instead of the step-wise descent. Much of the tuning in gradient descent is similar to logistic regression; however, penalization works more similarly to linear regressions (with an alpha value that increases along with the applied penalization) and a loss function must be provided.
Through grid search cross validation and an adjusted cut-off probability threshold, our optimal model achieved maximum sensitivity of 72.9% with L1 regularization, virtually no penalty, and using a ‘log’ loss which the logistic form of SGD requires. This was about 6.8pp sensitivity below our optimal logistic model.
For tree-based classifier models, we firstly built a Random Forest Classifier, which randomly selects features based on the ‘gini’ criterion. After the 5-fold grid search cross validation, the optimal random forest model gave us 85.5% accuracy and 85.4% sensitivity that increased by 5.7pp as compared with our optimal logistic model.
We then built a Gradient Boosting (GBoost) Classifier and EXtreme Gradient Boosting (XGBoost) Classifier, which in general uses decision trees as weak learners and then builds sequential tree models on top of them to minimize the loss function of the residual. Similarly, we used 5-fold grid search cross validation to identify our optimal hyperparameters.
The optimal models both used deviance as the loss function and Friedman Mean Square Error as criterion. For our GBoost Classifier, the accuracy was 85.9%; the sensitivity was 85.8%, which increased by 0.4pp compared with the Random Forest model. Our XGBoost Classifier gave us 85.8% accuracy and 86.1% sensitivity, which is the highest score among all models.
The Feature Importance Score of XGBoost shows the five most important features are maximum value of glucose, maximum value of PTT (partial thromboplastin time), maximum value of urea nitrogen, maximum platelet counts, and the maximum value of PO2 (partial pressure of oxygen), which all corresponded to our medical research as extreme values that indicate CHF in a patient.
Given certain assumptions we made throughout our project, we are aware that there are potential risks in our model. In the case that we create this prediction engine as a product and use it as a real tool for doctors, we would perform additional research on each of the points below, and possibly alter or split our model to take our research results into account when utilizing it with doctors:
- Imputing assumed healthy values for unknown blood measurements (logistic models)
- People with “extreme” blood measurement values (pregnant women, anemia, etc.)
- Compound conditions (i.e. diabetic and had a stroke but without CHF)
- Legal pressure on doctors for misdiagnosis or malpractice
- Age obscuration due to HIPAA compliance inevitably will affect model’s predictions
Our optimal performance model was XGBoost with 85.8% accuracy and 86.1% sensitivity. An added benefit of XGBoost is the output of feature importance, which both allows our connection to medical research and is insightful in drawing which attributes of a patient will be most decisive in recognizing patients with Congestive Heart Failure. Lastly, the decision-making methodology of all tree-based models allows for missingness of data and age assumptions which helped us omit some of the risks inherent in our data set.
One thing we lose with XGBoost is interpretability to non-data scientists, as the model decision-making requires deeper statistical knowledge and randomized boosting methods not easily understood by most patients or medical staff. This may become an important risk to consider when rolling this product into hospitals and seeking legal compliance as both relevant parties would rightfully scrutinize such a product.
One success of the logistic models is the interpretability. The decision-making criteria for penalization and coefficients have a direct relationship to model results and feature importance; both of which can also have a clear connection back to medical research. A clear downside to logistic models is the imputation required and the assumptions on an individual’s unknown health status. Additionally, these models were not as accurate nor as sensitive as any of our tree-based models.
Drawing insights from all models enables us to make an important connection between medical research and machine learning that confirms the medical field is ripe for this sort of innovation. Pushing the boundaries of prediction can enable early detection of disease, which can help prevent or prolong the onset of illness and extend life expectancy for individuals.
In the future, we would like to enhance our project by taking some of the assumptions/risks laid out above into account. We would seek to build potentially separate models or build other facets to our same model given the below.
- Children’s models (age)
- Old people’s models (age)
- Holistic analysis (X-rays, electrocardiograms, chart events)
- Other hospitals in the US and abroad (model on different assumptions: under check)
- Multiple-disease prediction engine
Thank you so much for reading about our capstone project! To see our project work and presentation, check out the github repository for this project here.