Higgs Boson Kaggle Machine Learning Competition

Contributed by Wanda Wang, Rob Castellano, and Ho Fai Wong. They are currently in the NYC Data Science Academy 12 week full-time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on their fourth class project - Machine Learning (due on the 8th week of the program).

I. Introduction

The ATLAS experiment and the CMS experiment recently claimed the discovery of the Higgs boson, a particle theorized almost 50 years ago to have the role of giving mass to other elementary particles. The ATLAS experiment at the Large Hadron Collider (LHC) at CERN provided simulated data used by physicists in a search for the Higgs boson for the Higgs Boson Machine Learning Challenge organized on Kaggle.

The immediate goal of the Kaggle competition was to explore the potential of advanced classification methods to improve the statistical significance of the experiment. Specifically, the Higgs boson has many different processes through which it can decay. When it decays, it produces other particles via specific channels. The challenge consisted in developing a model to classify events corresponding to a Higgs to tau tau decay channel (signal) versus other particles and processes (background).

The Higgs Boson kaggle dataset was used in this analysis. The data exploration and machine learning was performed in R. All the code is available here.

II. Exploratory Data Analysis


We carried out exploratory data analysis and quickly noticed that 11 of the 30 features contained missing data (initially recorded as -999). Upon closer inspection, 10 of the 11 features were related to jets (pseudo-particles) and specifically to the number of jets for a given event (recorded in the PRI_jet_num factor variable):

  • 0 jets: features related to jets were all missing data, understandably
  • 1 jet: features related to primary jets were no longer missing data, but those related to 2 or more jets were still missing
  • 2 or more jets: all features related to jets contained data

In other words, the jet number can be treated as a factor for missingness in these 10 features. The 11th feature with missing data was DER_mass_MMC (unrelated to number of jets), and was due to the topology of the event being too far from the expected topology (not an error in measurement).  The figure below shows the proportion of missingness as a function of jet number.


Given these observations on the missing data, we decided not to perform advanced imputation moving forward.

Principal Component Analysis

We then carried out Principal Component Analysis (PCA) to get a better understanding of the relationship between the 30 different features. When the features are plotted along the first principal component versus the second principal component (Figure below), it is clear that there are several clusters of features. These clusters of features indicates that collinearity is present among many of the features.

PC1PC2One relationship that particularly stood out across 9 principal components was between the feature derived mass and label (whether the particle is signal or background noise) which showed that the derived mass and label have a very strong relationship.


Importance of Mass

Looking deeper into the mass features, mass seems a good predictor of presence of the Higgs Boson since the derived mass of the Higgs Boson is different from other (Z and W) Bosons and subatomic particles.  Derived mass was the differentiating feature that the CERN scientists used to determine whether a particle was a Higgs Boson particle or not. The upper figure below is taken from the Science paper where the CERN scientists explain their findings.  The small blue area centered at 125 GeV represents signal from Higgs Boson while the red area centered at 98 GeV represents background noise from Z Bosons. The lower figure below, we reproduce this graph using the simulated data set from the Kaggle competition.  The simulated data set has a higher signal to background ratio than found experimentally to help with the model fitting. The dataset is weighted to take this discrepancy into account.



Correlation Matrix

Plotting a correlation matrix indicates strong covariance among subsets of the features.  This was taken into consideration when we explored using predictive models in the next section.


Logistic Regression

We performed an initial logistic regression to further explore the dataset and investigate variable importance.

Initial Feature Engineering

14 Features with long-tailed distributions were log transformed to reduce the positive skew towards smaller values, generating a more uniform distribution e.g. DER_mass_jet_jet, the invariant mass of the two jets. This feature engineering to create more normal distributions can help with the model fitting. Although this feature engineering was not specifically used in our models, we would consider performing log transformations of these 14 features in the future.



Variable Importance

The bar charts below illustrate the variable importance obtained from logistic regression for the saturated (30 variables) and the stepwise BIC model (18 variables). To note, 12 DER prefixed variables were kept as a result of the stepwise model, compared to only 6 PRI variables which remained. Derived variables thus have more descriptive value.



Choice of AUC as model fit metric

In order to iteratively tune, improve, and combine various machine learning algorithms, a stable and reliable model fit metric was crucial. The Kaggle competition used approximate median discovery significance (AMS) scores to assess model performance but after performing online research and running a few simulations, it seemed that the AMS metric wasn't very stable. We opted instead to use the receiver operating characteristic area under the curve (AUC) which maximizes the true positive rate while also minimizes the false positive rate, and produces a stable, smooth and continuous function unlike AMS.


Logistic Regression Analysis

The logistic regression led to the following results:

  • Saturated Model: R.Squared: 0.20227
  • Stepwise BIC model: R.Squared: 0.20223

In comparing the deviance of the stepwise model to the deviance of the saturated model, the p-value for the overall test of deviance is > 0.65 (high). The AUC plots are also not very different from one another.


III. Models

We decided to use 3 machine learning algorithms to build an ensembled classification model:

  • Random Forest
  • Gradient Boosting Model
  • XGBoost

Random Forest

For the Random Forest (RF) model, we performed a 5-fold cross-validation to tune the number of randomly selected features used in a given decision tree (mtry).  Random forests with a wide mtry grid (mtry: 1, 2, 3, 6, 9) on small subset (20%) of the training data was fit. The best model had a mtry of 6. Random forests with a more narrow mtry grid (4, 5, 6, 7, 8) centered around the previous best mtry of 6 was fit to a large subset (80%) of the training data. The best RF model in this case had a mtry of 5, and this model was used for prediction.

The training prediction accuracy was calculated by using the RF model to predict on the 20% of the training data that was not used to fit the RF model. The AUC on this data was = 0.9071, and was considered satisfactory. This model was then fit to the test data and submitted to the public leader board of kaggle, where the prediction received an AMS score of 2.57949 and was ranked 1,311 on the leader board.

The variable importance graph (below) for the RF model shows the top 20 most important features in the model. The top 3 most important features are related to mass, which is in agreement with the science behind Higgs Boson detection.


Gradient Boosting Model

For the Gradient Boosting Model (GBM), we performed a 5-fold cross-validation using a tuning grid, resulting in the following parameters:

  • shrinkage i.e. learning rate: 0.1
  • interaction_depth i.e. depth of variable interactions: 3
  • n.trees i.e. number of trees: 150
  • n.minobsinnode i.e. minimum number of observations in a terminal node: 10

In summary:

  • AUC on training data = .855
  • Kaggle rank = 1394
  • AMS = 2.30069

The GBM model performed slightly worse than Random Forest, for the chosen parameter values.

The variable importance graph for the GBM model also highlights the importance of the mass related features, but also the derived momentum as well.



We then used XGBoost to attempt to improve the results achieved thus far. XGBoost is a fast gradient boosting algorithm implementing in C++ by Tianqi Chen. It allows for some parallel computation, more tuning parameters and is generally faster and performs better than gbm; this is due to coding efficiency and the fact that xgboost is not a completely greedy algorithm (unlike gbm).

We also performed a 5-fold cross-validation using a tuning grid, resulting in the following parameters:

  • nrounds i.e. number of trees: 200
  • max_depth: 5
  • colsample_bytree i.e. percent of parameters used at each split: 0.85
  • eta i.e. learning rate: 0.2

In summary:

  • AUC on training data = .9254
  • Kaggle rank = 1340
  • AMS = 2.49958

The XGBoost model performed better than GBM but still not as well as Random Forest, for the chosen parameter values.

The variable importance graph for the XGBoost model also highlights the importance of the mass related features.


Step-by-step code used to run XGBoost is shown below:

  • Read in the data, perform data munging, create train/test datasets:
  • Tune XGBoost parameters by iterating until model stable (1st and 5th tune shown below):
  • Use predictions on training dataset to plot ROC-AUC and determine best threshold:
  • Predict on test dataset and create submission file for Kaggle:


Finally, we ensembled our 3 models by majority vote in an attempt to further improve the overall model accuracy.

We ultimately achieved a slight improvement on the Random Forest model:

  • Kaggle rank = 1309
  • AMS = 2.58510

IV. Room for Improvement

As with any modeling, there is always room for improvement. In particular, feature engineering i.e. including additional variables could improve our model's predictive accuracy:

  • Introducing basic physics features, e.g. Cartesian coordinates of momentum variables
  • Advanced physics: e.g. CAKE variable
  • Better understand the physics of additional variables
  • Perform appropriate transformations on variables, such as the log transformations described above

In addition, using more machine learning algorithms, more sophisticated ensembling methods and esembling running different random seeds for the same model could further increase the discriminating accuracy of our final model.

In the meantime, if you have any comments, suggestions or insights, please comment below!

About Authors

Ho Fai Wong

With a diverse background in computer science and 9 years in Financial Services Technology Consulting, Ho Fai has been applying his analytical, problem-solving, relationship and team management skills at PwC, one of the Big Four consulting firms, focusing...
View all posts by Ho Fai Wong >

Wanda Wang

Wanda is excited about combining data science with compelling narratives to uncover new enterprise opportunities. With 5+ years of experience in the Investment Management field, including at both Citigroup and JPMorgan - Wanda thrives in demanding, client-driven environments...
View all posts by Wanda Wang >

Rob Castellano

Rob recently received his Ph.D. in Mathematics from Columbia. His training as a pure mathematician has given him strong quantitative skills and experience in using creative problem solving techniques. He has experience conveying abstract concepts to both experts...
View all posts by Rob Castellano >

Related Articles

Leave a Comment

Google August 31, 2021
Google Wonderful story, reckoned we could combine some unrelated information, nevertheless actually really worth taking a look, whoa did one discover about Mid East has got extra problerms too.
Google March 24, 2021
Google We came across a cool web page that you simply could get pleasure from. Take a look if you want.
Google September 15, 2020
Google Wonderful story, reckoned we could combine several unrelated data, nonetheless truly worth taking a look, whoa did 1 learn about Mid East has got much more problerms also.
Isaac Burbank November 29, 2017
It's actually very difficult in this full of activity life to listen news on TV, thus I only use web for that reason, and take the newest information.
Doug September 10, 2016
Thanks for the excellent advice, it actually is useful.
Lords Mobile Hack September 7, 2016
Great line up. We'll be linking to this great post on our site. Keep up the great writing.
http://www.porthacks.com/ September 7, 2016
Saved as a favorite, I actually like your site!
Magnolia August 14, 2016
Thanks for the great info, it really is useful.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI