Predicting the Higgs Boson: Clustered Factor Variables, Separated Data, and XGBoosting

William Bartlett
Christian Holmes
Posted on Sep 6, 2016


The Higgs boson is a crucial particle to the Standard Model, our current best guess about how the Universe operates. It was first theorized to exist in the 1960's, but was not confirmed until 2012, using data from CERN's Large Hadron Collider. This was a seminal discovery, as it was the last elementary particle to be confirmed. In this project, we first examine the data that was used to determine the Higgs boson's existence. We then use the same data from CERN to predict the presence of the Higgs Boson in an attempt to improve upon the current algorithms in use.


The data is from CERN's Large Hadron Collider and contains 250,000 observations of 31 variables--some derived, some primitive.  It contains a great deal of missingness; about 70% of the observations are incomplete, and about 20% of the overall tabular data is missing.  However, the patterns of missingness are determined almost entirely by the variable "PRI_jet_num."  Depending on whether "PRI_jet_num" was 0,1,2 or 3, certain columns were entirely missing or entirely complete.  Only the "Der_mass_MMC" variable contained what appeared to be random missingness independent of the "PRI_jet_num" variable.  Thus, the data was broken down into four datasets ("PRI_jet_num" = 0, 1, 2 & 3) so that missing columns could be removed--imputation was only performed on "Der_mass_MMC."


Understanding the Data Through Clustering 

(the following is for the dataset when PRI_jet_num = 0).

Considering that the ultimate goal of the project was to separate the boson particle from background noise, we first decided to look at how the data clustered naturally, and how signal and noise was distributed within those clusters.  Though a k=2 clustering resulted in a signal-background distribution (s-b distribution) that was similar to the overall s-b distribution, it became apparent that as k increased, clusters that were more uniformly signal or background started to emerge.

k = 2

K = 3


k = 8









The first graph shows two clusters with S-B distributions similar to the overall S-B distribution (the first bar-chart in each image). However, as the number of bar-charts increases (k=3, k= 8 are pictured), the distributions become more and more lopsided. Most uniformly "b" clusters for each k are marked with a star, and the percent "b" of each starred bar-chart increases with k.

So, as the data is broken down into more and more clusters, it starts to naturally separate into S's and B's.  But how far does this effect go?  The maximum percent "b" cluster for each K is shown on the following graph.

Screen Shot 2016-09-05 at 8.16.52 PM













We see that the maximum within-cluster percent "b" for each K increases with K, and starts to level off just shy of 100% at k = 9.

Looking at the same type of graph for S instead of B, we see that, though it unsurprisingly takes longer (signal is only about 25% of the training data for jet_num = 0), it too increases with K, and levels off at 100%.

Screen Shot 2016-09-05 at 8.30.04 PM

Because the clusters become more uniformly s or b with more clustering, we can use the cluster assignments of each observation as information regarding its s-b status.  We'll want to use values of K that give us the maximum average difference between each cluster s-b distribution and the overall s-b distribution.  We can see the average difference for each k in the following graph:

Screen Shot 2016-09-05 at 8.42.44 PM













We see that the maximal average difference occurs at k = 3.  Therefore, knowing the cluster assignment of an observation for k=3 tells us the most about that observation's s-b status.


Logistic Regression

In order to understand the predictive power of cluster assignments, a logistic regression model was fit.  The following table demonstrates the impact of assigning K means factor variables to the data.

Screen Shot 2016-09-05 at 8.59.57 PM












The addition of factor variables continued to improve the model according to the AIC, BIC, and model accuracy.  Perhaps most significant is the continued reduction of the BIC, which favors models with fewer variables.  In the last line, despite adding 13 factor variables to the model (the cluster assignments for k=2 through k = 15), the BIC still improved!

Additionally, the cluster variables compare favorably even to many of the original variables.  The 3rd and fourth rows of the following table show the logistic regression's performance with the addition of variables for K = 2 through K = 8, compared to the same yet with all original primitive variables removed.

Screen Shot 2016-09-05 at 9.10.03 PM











The removal of all primitive variables hurt the model, but not nearly as much as the removal of all factor variables (compare the difference between row 3 and 4, and 3 and 1)!

A similar analysis to the above was performed for each of the other data splits.  We can see the behavior of the average difference from the overall s-b distribution for each cluster in K for each of the splits in the following graph.

Screen Shot 2016-09-05 at 9.24.53 PM













Factor variables improved a logistic regression for each split of the data similarly to how they improved it for PRI_jet_num = 0.  From this insight, we can deduce that there are natural clusters of Ss and Bs in the data, and that using proximity to cluster centers from the training set could be powerful in predicting Ss and Bs in the test set.

Finding the Signal

After demonstrating the power of clustering our data, we turned our attention to finding the best model with which to combine our cluster information with the variables of the original dataset in order to obtain the best predictive accuracy.  First, we sought to validate our decision to split the data by running more powerful algorithms on both the split and un-split data.


We narrowed our focus to two separate algorithms: gbm and xgboost. We ran the gbm test on our four separate data frames from our training set, followed by simulating predictions using the training set outcomes.


Finally, we predicted our results using the test data, and uploaded our results into Kaggle. We applied this same method using the complete data frame, and upload those results into Kaggle as well. From the single data frame, we scored 1.16 and placed 1639 overall. From the broken out method, we scored 2.02, for a very significant increase over the single data-frame approach.


Considering the tremendous success that other Kaggle teams have had using XGBoost, we wanted to test our method using the technique. While the data was already split accordingly, we simply had to convert our data to matrices in order to train the model.

Next, we wanted to determine a good depth and number of rounds to run the XGBoost over.  We wanted to ensure that our model was comprehensive enough to capture the effect of splitting the data, so we used the parameters from another successful Kaggle submission (see blog here). The author of the post used a depth of 9 and 3000 rounds. This required fairly substantial computing power, and it took around 45 minutes to run every time.



We then ran the same model using the complete data frame, and submitted both results to Kaggle. The non split model for XGBoost scored a 2.36 overall and finished in the 1371 spot. The split model scored 2.83 and finished in the 1184th position, for a clear advantage over the non-split model.

Next Steps

Our most important next step is to assign cluster classifications to the test set and pass those in to these more powerful algorithms (GBM, XGBOOST) in order to increase predictive performance.


Cluster analysis of the training data showed that the data does naturally cluster into Ss and Bs with increasing values of K, and that using these clusters as variables can significantly increase the accuracy of simple predictive models.

In regards to our powerful predictive models, our experiment showed that splitting the data frame and analyzing them separately had a clear advantage when using both gbm and xgboost algorithms. As a good number of successful Kaggle submissions used one of these algorithms, it's possible that splitting their data-frames could further improve their scores.

About Authors

William Bartlett

William Bartlett

Will Bartlett is a History of Science and Medicine Major from Yale University who recently took a leave of absence from medical school to explore data science. As an undergraduate, he studied the role of data in medicine...
View all posts by William Bartlett >
Christian Holmes

Christian Holmes

Christian Holmes is a graduate of Middlebury College with a B.A. in both Economics and Chemistry. Upon graduating, he spent two years as a data analyst at an advertising technology startup, where he became interested in predictive analytics....
View all posts by Christian Holmes >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Demo Lesson Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp