Higgs Boson Machine Learning Challenge

Contributed by (Emma) Jielei Zhu, Le Wei, Shuheng Li, Shuye Han and Yunrou Gong. They are currently in the NYC Data Science Academy 12-week full time Data Science Bootcamp program taking place from July 5th to September 23rd, 2016. This post is based on their fourth project - Machine Learning (due on the 8th week of the program). 


  1.  Introduction
    • 1.1 Background Review
    • 1.2 Project Objective
  2. Methodology
  3. Pre-processing Data
    • 3.1 Subset Data by number of jets
    • 3.2 Imputation Missing Value
  4. Data Description 
    • Exploratory Data Analysis
    • Principle Component Analysis
  5. Machine Learning Approaches
    • 5.1 Random Forests
    • 5.2 Gradient Boosting Machine
    • 5.3 XGBoost
    • 5.4 Reduced Features
  6. Conclusion 

1. Introduction

1.1 Background Review

The discovery of Higgs Boson is crucial, as is the final ingredient of the Standard Model of particle physics, a field filled with physicists dedicated to reducing our complicated universe to its most basic building blocks. The discovery of Higgs Boson was announced on July 4, 2012 and was acknowledged by the 2013 Nobel prize in physics. However, discovery is only the very first step. We still need to understand how it behaves. As it turns out, it is quite difficult to measure the new particle's characteristics and determine if it fits the current model of nature. A key finding from ATLAS experiment is that the Higgs boson decays into tau tau channel, but this decay is a small signal buried in background noise.

1.2 Project Objective

The goal of this Machine Learning Challenge is to improve on the discovery significance of the tau tau decay signal. Using simulated data with features characterizing events detected by ATLAS, our task is to classify events into "tau tau decay of a Higgs boson" or "background".

The data contains a training set with signal/background labels and with weights, a test set (without labels and weights), and a formal evaluation metric representing an approximation of the median significance (AMS) of the counting test. The performance of the classification result is thus evaluated by the following formula:


In sum, the goal of this project is to maximum the AMS score to improve the discovery significance of the tau tau decay signal of Higgs Boson from background noise.


2. Methodology

We first tried unsupervised learning methods to analyze the features of training and test dataset. Secondly, we used different supervised Machine Learning approaches to build models to classify events into either the tau tau decay signal of Higgs Boson or background noise.


3. Pre-Processing Data

3.1 Subset Data by number of jets

According to the description of the features in the technical document provided by the host of this challenge,  the "PRI jet num" feature  is the number of jets taking on integer values of 0,1,2 and 3. This feature is very unique in that it directly affects missingness in other features (e.g. "DER delta eta jet jet", "DER mass jet jet",  and "DER prod eta jet jet" are undefined if the number of jet is 0 or 1). In light of this characteristic of the data, we broke the training dataset into 4 subsets according to the jet number column(i.e. 0, 1, 2 or 3) and then examined the missingness patterns for each subset (below).

missing value heat map jet0

missing value heat map train_jet1

missing value train_jer2missing value of train_jet3


From the figures above, we can see that for the 'jet=0' subset, there are 11 columns containing missing values with 10 of them being entirely missing. For the 'jet=1' subset, there are 8 columns containing missing values with 7 of them being entirely missing. Lastly, for the 'jet=2' and 'jet=3' subsets,   they have the same pattern of missingness: only the first variable "DER_mass_MMC" contains missing values.

The reason why some of the columns are entirely missing is because the values of these columns are undefined for certain jet number groups. Therefore we deemed it reasonable to drop these columns that are entirely missing. Observing that 'jet=2' and 'jet=3' subsets have the same pattern of missingness, we combined these two subsets, resulting a total of 3 subset, which were 'jet=0', 'jet=1', 'jet=2/3'.

After removing the entire missing value columns, there is only one variable––"DER_mass_MMC"––that still contains missing values. We conducted a clustering analysis on each of the subsets to check whether we should separate the subsets into two groups based on whether the value of "DER_mass_MMC" is missing or not. The result showed out that two groups were not significantly different from each other, suggesting they do not need to treated separately.

Finally, the last step we did with the number of jets feature was to separate the test data into 3 subsets in the same way as we separated the training data.

3.2  Imputation of missing values

After subsetting the training dataset into 3 subsets, only the "DER_mass_MMC" feature still contains missing values (26%, 10% and 6%, respectively). Because the missing percentages are reasonably low, we thought of dropping these observations completely. However, we saw the test dataset also contained observations with missing "DER_mass_MMC" feature, and we can't drop any observation in the test dataset, obviously. So we kept the observations with missing values in the training set, and ended up imputing those values.

To impute the missing values in the "DER_mass_MMC" column, we tried a total of 4 imputation methods:

  1. PMM(predictive mean matching)
  2. Random Sampling
  3. KNN (K nearest neighbors) with varying K's (total of 6)
  4. Random Forest

To find the best imputation method, we used observations in the training set that had no missing values and took out some of the "DER_mass_MMC" values randomly. We then had each method impute these "artificial missing values". Using cross validation with the "ground truth" labels, Random Forest gave the best result. Thus, we used Random Forest to impute the "DER_mass_MMC" values for both the training and test set (separately for training and test and separately for each subset).


4.Data Description

4.1 Graphical Data Analysis

To understand each feature of this dataset, we visualized univariate distributions for each feature. We found that the features whose names end with 'phi' have a uniform pattern as shown from the graph above (plot shown from 'PRI_tai_phi'). The distributions for each class ('signal' and 'background') are extremely alike––nearly completely on top of each other. What this means is that the univariate distribution of the variables whose name ends with 'phi' do not contain much information to help distinguish 'signal' from 'background noise'.

From the bot plot of  'PRI_tai_phi' (see below), we can further understand that the 'phi' type variable is an angle basically, range from -180 degree(-3.14) to 180 degree(3.14) . The distribution is perfectly symmetric and no outliers and no difference by the labels.

However, the identical distributions should never be the reason for excluding these 'phi' features from applying to the machine learning models later on, because they may potentially has great influence for classifying the signal when they combine with other features.

Another important finding from the correlation matrix graphs in each of the subsets is that some of the features are strongly positive correlated to other variables, which may incur multicollinearity problems,
especially for conducting logistic regression models to classify the labels. Thus, we will not perform logistic regression in the following modeling part.

Additionally, 99% of the values of  'DER_pt_tot' are the same as of 'DER_pt_h' in subset data of jet number 0 , which indicates that for further modeling part, we could remove one of the features.In our reduced model part, we dropped the variable 'DER_pt_tot'.correlation_jet0




4.2 Principle Component Analysis 

From the Principle Component Analysis(PCA) scree plot for 'jet=0' subset, we found that the first dimension explains about 22.1% of total variance, the second dimension explains about 14.3%, and the third dimension explains about 14,1%. This means the first 3 dimensions together explains only about 50% of the total variance, which is quite low. Additionally, we don't see any sharp drop off in the percentage of variance explained from this scree plot, suggesting no natural cut off point in keeping certain dimensions and discarding others.


Looking at the scree plot for 'jet=1' subset, we see a very similar pattern––low variance explained for the first few dimensions and no sharp drop off for the later dimensions.


Again, the same pattern for the 'jet=2/3' subset.


From the variables factor map-PCA for train_jet0, we can see that DER_sum_pt, Der_mass_MMC, Der_mass_vis, and PRI_lep_pt contribute almost 80% of their variance on the PCA 1 and also contributed some variance on dimension 2, but not as much as the dimension 1. For dimension 2, we can see that Der_mass_transverse_met_lep contribute almost 80% of their variance on dimension 2, and also some portion of variance on Dimension 1.

From the Variables factor map- PCA for train_jet1, the Der_sum_p contribute a lot of variance on Dim1. For dimension 2, Der_mass_MMC and Der_mass_vis contribute a lot on dim 2, and also contribute part of variance on Dimension 1.vfm-pca_jet1


From the variables Factors map-PCA for train_jet 2.3. The Der_mass_jet_jet and DER deltaeta_jet_jet have large portion of variance contributed to dimension one. The variable Der_pt_h, PRi-jet-loading_R have large variance contributed on Dim2.



From the Exploratory Data Analysis on the previous three PCA, we found out that DER_mass_MMC contributed variance a lot on both train_jet0 and train_jet1, so we may assume that this variable has large impact on the total data. But we may use later test to see if this factor is truly important.


5. Machine Learning Approaches

5.1 Random Forest

We tried Random Forest as our first algorithm. We wanted to set a benchmark score and look at which variables Random Forest thinks are important and which are not. Additionally, we wanted to compare and contrast the list of variables PCA thinks are important against the list Random Forest thinks are important. We believed that this analysis would not only let us understand the variables better, it would also help us eliminate some unnecessary variables that contribute little to the AMS score. There were a few variables that we specifically paid attention to, including “DER_mass_lep”—the only variable that contained missing values after segregating data by jet number, and variables that contained the term ’phi’— variables that explain little variance according to PCA.

Our entire modeling process with Random Forest came in 3 stages, with increasing AMS score for each stage (3.28 -> 3.42)

In the first stage, we fitted a Random Forest model for each of the 3 subsets using the ‘rf’ method in the caret package. The best models were selected based on absolute AMS scores, which happened when ‘mtry’ was 20 for each of the subsets (20 was the maximum number tested because . What that means is that the best models all wanted the maximum number of variables for consideration at each splitting point of the tree. The models looked fine, but when we plotted the Receiver-Operating Curve (ROC), we saw the predictions were 100% correct on the training set. This is a clear sign that we were probably overfitting our training data.

To solve, or at least alleviate, the overfitting problem, instead of choosing the models that outputted the highest AMS score, we went after the simplest models that were within 1 standard error of the best empirical models, which brought us to stage 2 of fitting Random Forest models. Using this new selection metric, we picked models that either had ‘mtry’ equals 1 or ‘mtry’ equals 2. When we plotted the ROC again, sadly, we still see a perfect classification rate.

We first fitted a Random Forest model with varying 'mtry' values. The ones tested initially included 1, 2, 5, 20. Out of these values, 20 gave the best AMS scores on each subset, although with marginal differences. We visualized the performance of the best models by plotting ROC and calculating the area under the curve (AUC) score. What we found was a plot that claimed the model was able to predict the training observations perfectly.


AMS = 3.28



How did we find the threshold ?


AMS = 3.42

mtry for 'jet=0' : 1

mtry for 'jet=1' : 3

mtry for 'jet=2/3' : 3


5.2 Gradient Boosting Machine



AMS = 3.35


5.3 XGBoost



ROC Curve for 'jet=2/3' subset


AMS = 3.48


 5.4 Reduced Features

feature reduced

 6. Conclusion





The following are the code for this project (all written in R)

About Authors

Yunrou Gong

Yunrou Gong worked as a Business Analyst for Sanity Lighting Co., a LED Manufacturer. Through her work as an analyst, she accumulated her interest in data-driven approach of problem solving. She holds a M.S. in Operational Research, specializing...
View all posts by Yunrou Gong >

Le Wei

Le is a data scientist enthusiast, he brought his passion for data science to this bootcamp. He was majored in marketing when he was reading bachelor's degree. He learned from college that how to make the right decision...
View all posts by Le Wei >

Jielei (Emma) Zhu

Emma (Jielei) Zhu graduated from New York University in May 2016 with a B.A. in Computer Science and Psychology and a minor in Mathematics. In school, Emma was able to explore her interests to the fullest by taking...
View all posts by Jielei (Emma) Zhu >

Shuheng Li

With the intention of becoming a great Data Scientist, Shuheng is an out-box thinker and self-motivator who focuses on Machine Learning and Statistics, and seeks for challenging projects and competitions that push his skills to an advanced level....
View all posts by Shuheng Li >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI