Understanding Glasgow Coma Scale Scores and Predicting Recovery
After suffering traumatic brain injury, patients in neurosurgery intensive care units are usually evaluated for responsiveness according to the Glasgow Coma Scale (GCS) at each hour of their stay. A GCS score consists of 3 components--a score for verbal response (1-5), motor response (1-6), and eye opening ability(1-4). These scores are taken individually and then summed up to produce an aggregate GCS score from 3-15. I was given the unique opportunity to analyze GCS scores taken at the Mount Sinai Neurosurgery ICU from 2012 through 2016 with the goal of assessing the GCS as a metric, understanding longitudinal patterns in patients, and determining the predictability of recovery based on time series activity.
- Mount Sinai Neurosurgery Department ICU Data
- 1,936,854 individual score entries
- 5456 patients
- The data was provided as an excel file with the following columns:
- patient visit ID
- GCS score
- Type of Score (total, verbal, eye, motor)
- Time of recording
Questions: The project aimed at the following questions:
- What is a GCS score telling us? How do sub-scores scores relate to each other?
- How do patients do? What can we say generally about recovery paths?
- Can we use features of a patient's GCS time series beyond their current score to predict outcome?
The data required substantial cleaning and reformatting. The kNN algorithm was used to impute a very small amount of missing data. The data came in as individual sub-scores; components of the same score were not connected except by the time of their taking, and entries of the same patient were connected only by a Visit ID #. The data was rearranged predominantly using R's dplyr, tidyr, and reshape2 packages. Additionally, the data included large gaps of time during which scores were not taken, so hourly scores were inferred from last recorded scores in order to normalize the amount of time represented by any single entry.
Getting Acquainted with the Data:
I. The distribution of scores is very left-skewed--the data is overwhelmingly 'well.' As we can see from the following bar-charts, the distribution of aggregate scores is concentrated around 15, and the sub-scores are similarly concentrated at high scores. The exception is verbal, which has substantial numbers at both extremes.
II. The sub-scores are highly correlated. All three scores have linear correlations above .5, and the eye/motor correlation is almost .7!
Yet the linear dependence does not tell the full story. Verbal, which has the lowest correlation coefficient, is the strongest indicator of other high scores. As we can see in the 3-way jitter plot below (color = motor), high verbal scores have very few low scores in the other categories; the top right corner of the graph is essentially absent of red dots (low motor), and the top left corner of the graph (high verbal, low eye) is empty. From its bipolar distribution shown above and the absence of points in the middle layer of the graph below, verbal clearly jumps from low to high without stopping at intermediate scores, which is probably what explains its lesser linear correlation. However, at high scores, verbal is the best indicator of wellness.
III. Aggregate scores have very low composition variance. While the architect of the GCS metric likely designed separate, dynamic sub-scores as a means of displaying independent features of the recovery process, the reality is that there is very little variation in the composition of a given aggregate score. For example, while there are many possible combinations of GCS scores that could produce a 7, few of them are actually observed. We can see from the sub-score breakdown of a 7, that 7s include almost exclusively verbal 1s, high motors, and low eyes.
In fact, verbal scores stay low in aggregate scores as high as a 12, and only in 13s do we see substantial verbal recovery. Yet even in high aggregate scores with non-zero verbal scores, composition remains extremely invariant.
IV. Principal component analysis (PCA) confirms the interdependence and redundancy of GCS scores. Parallel analysis suggests 1 principal component, but if we visualize the first two, we see that ALL three scores are similar along the first component, and the second only separates verbal.
Despite appearing very far, the distance between verbal and eye/motor (which are literally on top of each other) is quite small--note the scale of the x axis. Clearly, eye, motor, and verbal are redundant along the first principal component, and if you take the ill-advised second, it only distinguishes verbal. Medical practitioners have been keeping track of three scores and their total since the conception of the GCS in 1974, but have really only been looking at one--maybe one and a half--piece(s) of information.
My Recommendation: One numeric coma score and a binary mark (check/no check) for verbal.
How do patients do?
...and what can we say about recovery patterns? To determine patients of interest, we took the minimum score and last score for each visit ID (patient). We found that the kmeans clustering algorithm separated the patients into very intuitive recovery groups, as shown below (points are 'jittered' for visualization purposes--so they are not on top of each other).
For all patients who were at one point below a 12, their responsiveness progressions fit into one of the following descriptions: poor shape to well (blue), poor to ok (green), poor to poor (red), ok to ok (light blue), and ok to well (black). Additionally, we notice a natural separation (paucity of points) at about 12--one justification for the choice of 12 as the recovery threshold.
For the sake of good will, I'm pleased to point out that most patients do indeed get better, as demonstrated by the densely populated row at last score = 15.
Longitudinal Patterns and Prediction:
The above graph of min-score, last-score clustering frames the predictive question nicely--how do we separate the red and green from blue, light blue from black? Can we use differences in their GCS series to distinguish them?
It was helpful to find that most recovery takes place within the first ~25 hours of a patient hitting his/her minimum score. The following graph shows, for each hour since a patient hit their minimum, average GCS scores by cluster.
We can see that most curves have leveled off around entry # 25; most recovery happens within the first ~25 hours since a patient's minimum. Additionally, a distinguishing aspect of patients that make full recoveries (clusters 2 and 5) is noticeable verbal recovery. We don't see patients make the "verbal leap" in clusters that don't achieve recovery close to 15.
To see these time series visualizations, as well as length of stay and time spent at low scores distributions for a customizable subset of patients, go to: https://wbartlett.shinyapps.io/GCS_Shiny/
In order to frame a predictive question for recovery at different points in time, each entry had to be represented as an observation with a set of features descriptive of their GCS patterns until that point. This required the engineering of time series features imagined to be important, most of which were not seen in a previous organization of the data. The following features were produced:
- Current Score
- Hours since Current Minimum Score
- Time Spent between 3 and 8
- Time Spent between 9 and 11
- Time Spent between 12 and 15
- Trend over the last 7 entries
- (Current Score-12) * log(Hours Since Current Minimum Score)
The last feature was engineered to give scores later on in a progression more weight, divergent at 12 (the threshold for recovery). As seen in the following graph, current score becomes approximately logarithmically closer to discharge score with time. Thus, if more time has passed since a patient's minimum score, the model should view their current score as a stronger predictor of their ultimate discharge score.
Based on these features, various models were tried, with a preference for tree-based algorithms (as many of the features are likely not linear, but could make helpful distinctions between subgroups). Last score was converted to a binary output variable (greater than 11 –> 1) Ultimately, XGBoost (extreme gradient boosted decision trees) provided accuracy as high as just under 94% on unseen data. Accuracy varied substantially depending on the hour of the observation (because score becomes more representative of final score). Accuracy by hour is shown below.
While the utility of a model with this level of accuracy is somewhat subjective to the user, the general success of this model (over guessing the current score, guessing the majority outcome, or a simple linear regression of Last score ~ score) shows that we can make prognosticatory distinctions between similar GCS scores based on their time series activity. Taking into account aspects of a patient's GCS progression can indeed increase our knowledge of their chances for recovery.
- Evaluate individual features for their effect on recovery chances--increase interpretability.
- Discuss features with Mount Sinai neurosurgeons--gain insight on "what might matter" from domain experts.
- Merge with clinical data
- Fit more powerful models to increase absolute accuracy.
- Develop tool to provide clinicians and others with prognosticative insight.