Forecasting the Higgs Boson Signal

, and
Posted on Jun 8, 2016

Contributed by , and . They are currently in the NYC Data Science Academy 12 week full-time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on their fourth class project - Machine Learning (due on the 8th week of the program).


The Higgs Boson is a type of unstable subatomic particle that breaks down very quickly.  Scientist studies the decay of the collision and works backward. To assist scientist in differentiating the background noise from the signal, we offer some machine learning algorithms to better predict the Higgs Boson.


Exploratory Data Analysis

Like all data science projects, we began with some exploratory analysis of the variables. 

We first used a correlation plot to inspect the different relationships going on between variables. As indicated by the dark blue and dark red points, there seems to be a high correlation among many of the variables. We noticed the PRI_jet variables, for example, have a lot of blue dots in relation to the variables DER_deltaeta_jet_jet, DER_mass_jet_jet, and DER_prodeta_jet_jet. This is likely since according to the documentation for the challenge,  those DER variables are derived quantities computed from the PRI or primitive quantities measured from the particles.


Next, we wanted to zoom in on the correlation plot and see some scatterplots of select variables.

Here, we looked at scatterplots between PRI_jet_leading_pt with the three variables PRI_jet_leading_eta, PRI_jet_leading_phi, and PRI_jet_all_pt. The orange circles represent events that were classified as signal while the blue circles were those classified as background. We noticed some variables seemed to have a linear relationship, such as PRI_jet_leading_pt and PRI_jet_all_pt, while others did not have any obvious form.


Looking at scatter plots amongst select DER variables, we saw a linear relationship to still be present. Between DER_deltaeta_jet_jet and DER_prodeta_jet_jet for example, there was a negative linear relationship.


Finally, we had a look at some scatter plots between a DER variable and a few PRI variables. We found it interesting to see the plot of a DER variable and its related PRI variable from which it was calculated from. For instance, the variable DER_deltaeta_jet_jet and PRI_jet_subleading_eta had a somewhat v-shape as it is derived from the absolute of the difference between that primitive variable and another.




Classification Tree Approachslack5

Classification tree models are great for descriptive purposes.  They produce relatively easy regions to trace the process of the model. Although they tend to have a lower prediction accuracy.

Our first basic tree without pruning produced an accuracy of 72% with 3,
398 terminal nodes and an AMS score of 1.132.  The right-hand chart displays the number of misclassified observations by terminal nodes, suggesting a tree pruning of about five terminal nodes (where the errors remain flat as the terminal nodes increase).  The pruned tree with five terminal nodes returned a lower AMS score of 1.038.  After increasing the nodes to 10 the AMS score increased to 1.24.






Taking a look at the tree we see the Derived Lep Eta Centrality variable at the very top indicating its importance in determining background noise or a signal of the Higgs Boson. If we were to get a new observation the model would work as follows, beginning at the very top, ask is data point is greater or less than 0.0005? If less go left if greater go right.  Moving on to the next terminal node, and so on.  Very easy to interpret, but if you are interested in higher predictability this may not be the best option.


Random Forest Approachslack7

For a more robust model with greater prediction, we attempted a Random Forest.  A Random Forest is the average of a collection of trees, resulting in an ensemble with greater predictions. Although, unlike a single tree, you lose descriptive abilities.  We ran the model with all the variables, on the right-hand side is the variable importance plot returning an AMS score of 1.781. Impressive, just changing the model type the AMS increased by over 0.5 points.







We then decided to try a model with
only the top 14 variables, along with two columns created to focus on the missing values for DER_mass_transverse_met_lep and DER_deltaeta_jet_jet.  We found these columns to individually reduced the errors at most, and together worked great. Resulting in an AMS
score of 2.795 with an accuracy of 83.5
% selecting four features at a time at random, creating 500 trees, and taking the ave
rage.  This seems like a great model.





Reducing Dimensions

When people think about reducing dimensions, they may think about reducing the amount of information which would result in a less precise model. This is not always the case and it can sometimes improve models. Advantages of reducing dimensions include reducing the time and storage space required, making it easier to visualize data and remove multi-collinearity which would improve the model.

Because there are columns composed of other columns in the dataset, we decided to perform least absolute shrinkage and selection operator (LASSO) to remove possibly related predictors and focus on the ones that would give us a clearer model. After performing LASSO on the variables, we have scaled down many of them down to 0, leaving 12 that we will use.



Gradient Boosting Model

By applying Gradient Boosting Model (GBM) with certain parameters, we were able to get an AUC score of 0.8525 with a threshold of 0.002, yielding an AMS score of 2.28471.


After focusing our model on the 12 variables returned by LASSO, we attained a better score. We got an AUC score of 0.8525 with the same threshold of 0.002. This gave an AMS score of 2.30291, which was higher than the previous model run with all the variables. This shows us that we were able to attain a better predicting model by reducing the number of variables, which probably removed related variables.


Because the AMS score gained was not much higher, we decided to move onto another machine learning method that may yield better results.


Support Vector Machines

Finally, turning to a more predictive-based model, we attempted using the support vector machine algorithm. Using the scatterplots earlier and seeing as how there was much overlap between signal and background, we chose to use a radial kernel as the problem did not look linearly separable.

The pros and cons of using this algorithm are as follows:


  • Effective in high dimensional spaces
  • Choice of kernel


  • Loss of interpretability
  • Computationally inefficient when dataset becomes too large


Due to the model being computationally costly with the size of the dataset, we considered tweaking different parameters in order to get it running without consuming too much time. We chose to train it on 1% of the data and used 5-fold cross-validation to find the best estimates for cost and gamma. Next, we used the results from the random forest model and reduced the data set to the 14 most significant variables. Finally, with trial and error, we found the best threshold for the model to be 0.7. Using all these led to an accuracy of 0.8056 and an AMS score of 2.72036.


Considering the number of models we ran for the Higgs boson challenge, we recommend the random forest model since it has the greatest balance of predictability and interpretability. The random forest model gave us the highest AMS score of 2.795 and was able to be trained on all the data. We noted that in terms of AMS score the support vector machine model was second, however, we choose to recommend the random forest model as the latter is less complex to train, has less parameters to tweak, and is not computationally inefficient as the former. In addition, should the main purpose be for description, we recommend the basic classification tree with some pruning.

Next Steps

  • Apply some feature engineering to the dataset
  • Use ensembling methods to combine the different models that were used

About Authors

Denis Nguyen

With a background in biomedical engineering and health sciences, Denis has a passion for finding patterns and optimizing processes. He developed his interest for data analysis while doing research on the effects of childhood obesity on bone development...
View all posts by Denis Nguyen >


KB is a driven and determined Senior Analyst with nearly 15 years of proven data analytics expertise. Most recently focused on forecasting short-term and long-term global crude oil and product prices for PIRA Energy Group. Previously held a...
View all posts by Breton >

Ismael Jaime Cruz

Ismael’s roots are in finance and statistics. He has six years of experience in such areas as financial analysis, trading and portfolio management. He was part of the team that launched the very first exchange-traded-fund in the Philippines....
View all posts by Ismael Jaime Cruz >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI