Predicting the Higgs Boson: Clustered Factor Variables, Separated Data, and XGBoosting
Introduction
The Higgs boson is a crucial particle to the Standard Model, our current best guess about how the Universe operates. It was first theorized to exist in the 1960's, but was not confirmed until 2012, using data from CERN's Large Hadron Collider. This was a seminal discovery, as it was the last elementary particle to be confirmed. In this project, we first examine the data that was used to determine the Higgs boson's existence. We then use the same data from CERN to predict the presence of the Higgs Boson in an attempt to improve upon the current algorithms in use.
Data
The data is from CERN's Large Hadron Collider and contains 250,000 observations of 31 variables--some derived, some primitive. It contains a great deal of missingness; about 70% of the observations are incomplete, and about 20% of the overall tabular data is missing. However, the patterns of missingness are determined almost entirely by the variable "PRI_jet_num." Depending on whether "PRI_jet_num" was 0,1,2 or 3, certain columns were entirely missing or entirely complete. Only the "Der_mass_MMC" variable contained what appeared to be random missingness independent of the "PRI_jet_num" variable. Thus, the data was broken down into four datasets ("PRI_jet_num" = 0, 1, 2 & 3) so that missing columns could be removed--imputation was only performed on "Der_mass_MMC."
Understanding the Data Through Clustering
(the following is for the dataset when PRI_jet_num = 0).
Considering that the ultimate goal of the project was to separate the boson particle from background noise, we first decided to look at how the data clustered naturally, and how signal and noise was distributed within those clusters. Though a k=2 clustering resulted in a signal-background distribution (s-b distribution) that was similar to the overall s-b distribution, it became apparent that as k increased, clusters that were more uniformly signal or background started to emerge.
The first graph shows two clusters with S-B distributions similar to the overall S-B distribution (the first bar-chart in each image). However, as the number of bar-charts increases (k=3, k= 8 are pictured), the distributions become more and more lopsided. Most uniformly "b" clusters for each k are marked with a star, and the percent "b" of each starred bar-chart increases with k.
So, as the data is broken down into more and more clusters, it starts to naturally separate into S's and B's. But how far does this effect go? The maximum percent "b" cluster for each K is shown on the following graph.
We see that the maximum within-cluster percent "b" for each K increases with K, and starts to level off just shy of 100% at k = 9.
Looking at the same type of graph for S instead of B, we see that, though it unsurprisingly takes longer (signal is only about 25% of the training data for jet_num = 0), it too increases with K, and levels off at 100%.
Because the clusters become more uniformly s or b with more clustering, we can use the cluster assignments of each observation as information regarding its s-b status. We'll want to use values of K that give us the maximum average difference between each cluster s-b distribution and the overall s-b distribution. We can see the average difference for each k in the following graph:
We see that the maximal average difference occurs at k = 3. Therefore, knowing the cluster assignment of an observation for k=3 tells us the most about that observation's s-b status.
Logistic Regression
In order to understand the predictive power of cluster assignments, a logistic regression model was fit. The following table demonstrates the impact of assigning K means factor variables to the data.
The addition of factor variables continued to improve the model according to the AIC, BIC, and model accuracy. Perhaps most significant is the continued reduction of the BIC, which favors models with fewer variables. In the last line, despite adding 13 factor variables to the model (the cluster assignments for k=2 through k = 15), the BIC still improved!
Additionally, the cluster variables compare favorably even to many of the original variables. The 3rd and fourth rows of the following table show the logistic regression's performance with the addition of variables for K = 2 through K = 8, compared to the same yet with all original primitive variables removed.
The removal of all primitive variables hurt the model, but not nearly as much as the removal of all factor variables (compare the difference between row 3 and 4, and 3 and 1)!
A similar analysis to the above was performed for each of the other data splits. We can see the behavior of the average difference from the overall s-b distribution for each cluster in K for each of the splits in the following graph.
Factor variables improved a logistic regression for each split of the data similarly to how they improved it for PRI_jet_num = 0. From this insight, we can deduce that there are natural clusters of Ss and Bs in the data, and that using proximity to cluster centers from the training set could be powerful in predicting Ss and Bs in the test set.
Finding the Signal
After demonstrating the power of clustering our data, we turned our attention to finding the best model with which to combine our cluster information with the variables of the original dataset in order to obtain the best predictive accuracy. First, we sought to validate our decision to split the data by running more powerful algorithms on both the split and un-split data.
GBM
We narrowed our focus to two separate algorithms: gbm and xgboost. We ran the gbm test on our four separate data frames from our training set, followed by simulating predictions using the training set outcomes.
Finally, we predicted our results using the test data, and uploaded our results into Kaggle. We applied this same method using the complete data frame, and upload those results into Kaggle as well. From the single data frame, we scored 1.16 and placed 1639 overall. From the broken out method, we scored 2.02, for a very significant increase over the single data-frame approach.
XGBoost
Considering the tremendous success that other Kaggle teams have had using XGBoost, we wanted to test our method using the technique. While the data was already split accordingly, we simply had to convert our data to matrices in order to train the model.
Next, we wanted to determine a good depth and number of rounds to run the XGBoost over. We wanted to ensure that our model was comprehensive enough to capture the effect of splitting the data, so we used the parameters from another successful Kaggle submission (see blog here). The author of the post used a depth of 9 and 3000 rounds. This required fairly substantial computing power, and it took around 45 minutes to run every time.
We then ran the same model using the complete data frame, and submitted both results to Kaggle. The non split model for XGBoost scored a 2.36 overall and finished in the 1371 spot. The split model scored 2.83 and finished in the 1184th position, for a clear advantage over the non-split model.
Next Steps
Our most important next step is to assign cluster classifications to the test set and pass those in to these more powerful algorithms (GBM, XGBOOST) in order to increase predictive performance.
Conclusions
Cluster analysis of the training data showed that the data does naturally cluster into Ss and Bs with increasing values of K, and that using these clusters as variables can significantly increase the accuracy of simple predictive models.
In regards to our powerful predictive models, our experiment showed that splitting the data frame and analyzing them separately had a clear advantage when using both gbm and xgboost algorithms. As a good number of successful Kaggle submissions used one of these algorithms, it's possible that splitting their data-frames could further improve their scores.