NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Machine Learning > Higgs Boson Machine Learning Challenge

Higgs Boson Machine Learning Challenge

Yunrou Gong, Le Wei, Jielei (Emma) Zhu, Shuheng Li and Shuye Han
Posted on Aug 28, 2016

Contributed by (Emma) Jielei Zhu, Le Wei, Shuheng Li, Shuye Han and Yunrou Gong. They are currently in the NYC Data Science Academy 12-week full time Data Science Bootcamp program taking place from July 5th to September 23rd, 2016. This post is based on their fourth project - Machine Learning (due on the 8th week of the program). 

Content

  1.  Introduction
    • 1.1 Background Review
    • 1.2 Project Objective
  2. Methodology
  3. Pre-processing Data
    • 3.1 Subset Data by number of jets
    • 3.2 Imputation Missing Value
  4. Data Description 
    • Exploratory Data Analysis
    • Principle Component Analysis
  5. Machine Learning Approaches
    • 5.1 Random Forests
    • 5.2 Gradient Boosting Machine
    • 5.3 XGBoost
    • 5.4 Reduced Features
  6. Conclusion 

1. Introduction

1.1 Background Review

The discovery of Higgs Boson is crucial, as is the final ingredient of the Standard Model of particle physics, a field filled with physicists dedicated to reducing our complicated universe to its most basic building blocks. The discovery of Higgs Boson was announced on July 4, 2012 and was acknowledged by the 2013 Nobel prize in physics. However, discovery is only the very first step. We still need to understand how it behaves. As it turns out, it is quite difficult to measure the new particle's characteristics and determine if it fits the current model of nature. A key finding from ATLAS experiment is that the Higgs boson decays into tau tau channel, but this decay is a small signal buried in background noise.

1.2 Project Objective

The goal of this Machine Learning Challenge is to improve on the discovery significance of the tau tau decay signal. Using simulated data with features characterizing events detected by ATLAS, our task is to classify events into "tau tau decay of a Higgs boson" or "background".

The data contains a training set with signal/background labels and with weights, a test set (without labels and weights), and a formal evaluation metric representing an approximation of the median significance (AMS) of the counting test. The performance of the classification result is thus evaluated by the following formula:

AMS

In sum, the goal of this project is to maximum the AMS score to improve the discovery significance of the tau tau decay signal of Higgs Boson from background noise.

 

2. Methodology

We first tried unsupervised learning methods to analyze the features of training and test dataset. Secondly, we used different supervised Machine Learning approaches to build models to classify events into either the tau tau decay signal of Higgs Boson or background noise.

 

3. Pre-Processing Data

3.1 Subset Data by number of jets

According to the description of the features in the technical document provided by the host of this challenge,  the "PRI jet num" feature  is the number of jets taking on integer values of 0,1,2 and 3. This feature is very unique in that it directly affects missingness in other features (e.g. "DER delta eta jet jet", "DER mass jet jet",  and "DER prod eta jet jet" are undefined if the number of jet is 0 or 1). In light of this characteristic of the data, we broke the training dataset into 4 subsets according to the jet number column(i.e. 0, 1, 2 or 3) and then examined the missingness patterns for each subset (below).

missing value heat map jet0

missing value heat map train_jet1

missing value train_jer2missing value of train_jet3

 

From the figures above, we can see that for the 'jet=0' subset, there are 11 columns containing missing values with 10 of them being entirely missing. For the 'jet=1' subset, there are 8 columns containing missing values with 7 of them being entirely missing. Lastly, for the 'jet=2' and 'jet=3' subsets,   they have the same pattern of missingness: only the first variable "DER_mass_MMC" contains missing values.

The reason why some of the columns are entirely missing is because the values of these columns are undefined for certain jet number groups. Therefore we deemed it reasonable to drop these columns that are entirely missing. Observing that 'jet=2' and 'jet=3' subsets have the same pattern of missingness, we combined these two subsets, resulting a total of 3 subset, which were 'jet=0', 'jet=1', 'jet=2/3'.

After removing the entire missing value columns, there is only one variableโ€“โ€“"DER_mass_MMC"โ€“โ€“that still contains missing values. We conducted a clustering analysis on each of the subsets to check whether we should separate the subsets into two groups based on whether the value of "DER_mass_MMC" is missing or not. The result showed out that two groups were not significantly different from each other, suggesting they do not need to treated separately.

Finally, the last step we did with the number of jets feature was to separate the test data into 3 subsets in the same way as we separated the training data.

3.2  Imputation of missing values

After subsetting the training dataset into 3 subsets, only the "DER_mass_MMC" feature still contains missing values (26%, 10% and 6%, respectively). Because the missing percentages are reasonably low, we thought of dropping these observations completely. However, we saw the test dataset also contained observations with missing "DER_mass_MMC" feature, and we can't drop any observation in the test dataset, obviously. So we kept the observations with missing values in the training set, and ended up imputing those values.

To impute the missing values in the "DER_mass_MMC" column, we tried a total of 4 imputation methods:

  1. PMM(predictive mean matching)
  2. Random Sampling
  3. KNN (K nearest neighbors) with varying K's (total of 6)
  4. Random Forest

To find the best imputation method, we used observations in the training set that had no missing values and took out some of the "DER_mass_MMC" values randomly. We then had each method impute these "artificial missing values". Using cross validation with the "ground truth" labels, Random Forest gave the best result. Thus, we used Random Forest to impute the "DER_mass_MMC" values for both the training and test set (separately for training and test and separately for each subset).

 

4.Data Description

4.1 Graphical Data Analysis

To understand each feature of this dataset, we visualized univariate distributions for each feature. We found that the features whose names end with 'phi' have a uniform pattern as shown from the graph above (plot shown from 'PRI_tai_phi'). The distributions for each class ('signal' and 'background') are extremely alikeโ€“โ€“nearly completely on top of each other. What this means is that the univariate distribution of the variables whose name ends with 'phi' do not contain much information to help distinguish 'signal' from 'background noise'.

From the bot plot of  'PRI_tai_phi' (see below), we can further understand that the 'phi' type variable is an angle basically, range from -180 degree(-3.14) to 180 degree(3.14) . The distribution is perfectly symmetric and no outliers and no difference by the labels.

However, the identical distributions should never be the reason for excluding these 'phi' features from applying to the machine learning models later on, because they may potentially has great influence for classifying the signal when they combine with other features.

Another important finding from the correlation matrix graphs in each of the subsets is that some of the features are strongly positive correlated to other variables, which may incur multicollinearity problems,
especially for conducting logistic regression models to classify the labels. Thus, we will not perform logistic regression in the following modeling part.

Additionally, 99% of the values of  'DER_pt_tot' are the same as of 'DER_pt_h' in subset data of jet number 0 , which indicates that for further modeling part, we could remove one of the features.In our reduced model part, we dropped the variable 'DER_pt_tot'.correlation_jet0

 

correlation_jet1crrelation_jet2.3

 

4.2 Principle Component Analysis 

From the Principle Component Analysis(PCA) scree plot for 'jet=0' subset, we found that the first dimension explains about 22.1% of total variance, the second dimension explains about 14.3%, and the third dimension explains about 14,1%. This means the first 3 dimensions together explains only about 50% of the total variance, which is quite low. Additionally, we don't see any sharp drop off in the percentage of variance explained from this scree plot, suggesting no natural cut off point in keeping certain dimensions and discarding others.

 

Looking at the scree plot for 'jet=1' subset, we see a very similar patternโ€“โ€“low variance explained for the first few dimensions and no sharp drop off for the later dimensions.

 

Again, the same pattern for the 'jet=2/3' subset.

 

From the variables factor map-PCA for train_jet0, we can see that DER_sum_pt, Der_mass_MMC, Der_mass_vis, and PRI_lep_pt contribute almost 80% of their variance on the PCA 1 and also contributed some variance on dimension 2, but not as much as the dimension 1. For dimension 2, we can see that Der_mass_transverse_met_lep contribute almost 80% of their variance on dimension 2, and also some portion of variance on Dimension 1.
VFM-pca_jet0_contribution

From the Variables factor map- PCA for train_jet1, the Der_sum_p contribute a lot of variance on Dim1. For dimension 2, Der_mass_MMC and Der_mass_vis contribute a lot on dim 2, and also contribute part of variance on Dimension 1.vfm-pca_jet1

 

From the variables Factors map-PCA for train_jet 2.3. The Der_mass_jet_jet and DER deltaeta_jet_jet have large portion of variance contributed to dimension one. The variable Der_pt_h, PRi-jet-loading_R have large variance contributed on Dim2.

vfm-pca_jet2.3

 

From the Exploratory Data Analysis on the previous three PCA, we found out that DER_mass_MMC contributed variance a lot on both train_jet0 and train_jet1, so we may assume that this variable has large impact on the total data. But we may use later test to see if this factor is truly important.

 

5. Machine Learning Approaches

5.1 Random Forest

We tried Random Forest as our first algorithm. We wanted to set a benchmark score and look at which variables Random Forest thinks are important and which are not. Additionally, we wanted to compare and contrast the list of variables PCA thinks are important against the list Random Forest thinks are important. We believed that this analysis would not only let us understand the variables better, it would also help us eliminate some unnecessary variables that contribute little to the AMS score. There were a few variables that we specifically paid attention to, including โ€œDER_mass_lepโ€โ€”the only variable that contained missing values after segregating data by jet number, and variables that contained the term โ€™phiโ€™โ€” variables that explain little variance according to PCA.

Our entire modeling process with Random Forest came in 3 stages, with increasing AMS score for each stage (3.28 -> 3.42)

In the first stage, we fitted a Random Forest model for each of the 3 subsets using the โ€˜rfโ€™ method in the caret package. The best models were selected based on absolute AMS scores, which happened when โ€˜mtryโ€™ was 20 for each of the subsets (20 was the maximum number tested because . What that means is that the best models all wanted the maximum number of variables for consideration at each splitting point of the tree. The models looked fine, but when we plotted the Receiver-Operating Curve (ROC), we saw the predictions were 100% correct on the training set. This is a clear sign that we were probably overfitting our training data.

To solve, or at least alleviate, the overfitting problem, instead of choosing the models that outputted the highest AMS score, we went after the simplest models that were within 1 standard error of the best empirical models, which brought us to stage 2 of fitting Random Forest models. Using this new selection metric, we picked models that either had โ€˜mtryโ€™ equals 1 or โ€˜mtryโ€™ equals 2. When we plotted the ROC again, sadly, we still see a perfect classification rate.

We first fitted a Random Forest model with varying 'mtry' values. The ones tested initially included 1, 2, 5, 20. Out of these values, 20 gave the best AMS scores on each subset, although with marginal differences. We visualized the performance of the best models by plotting ROC and calculating the area under the curve (AUC) score. What we found was a plot that claimed the model was able to predict the training observations perfectly.

rf_2_3_AUC_1

AMS = 3.28

 



rf_varimp_2_3

How did we find the threshold ?

percentageofs_480

AMS = 3.42

mtry for 'jet=0' : 1

mtry for 'jet=1' : 3

mtry for 'jet=2/3' : 3

 

5.2 Gradient Boosting Machine

 

gbm_varimp_2_3

AMS = 3.35

 

5.3 XGBoost

 


xgb_varimp_2_3

ROC Curve for 'jet=2/3' subset

xgb_2_3_auc_480

AMS = 3.48

 

 5.4 Reduced Features

feature reduced

 6. Conclusion

 

 

 

 

The following are the code for this project (all written in R)

About Authors

Yunrou Gong

Yunrou Gong worked as a Business Analyst for Sanity Lighting Co., a LED Manufacturer. Through her work as an analyst, she accumulated her interest in data-driven approach of problem solving. She holds a M.S. in Operational Research, specializing...
View all posts by Yunrou Gong >

Le Wei

Le is a data scientist enthusiast, he brought his passion for data science to this bootcamp. He was majored in marketing when he was reading bachelor's degree. He learned from college that how to make the right decision...
View all posts by Le Wei >

Jielei (Emma) Zhu

Emma (Jielei) Zhu graduated from New York University in May 2016 with a B.A. in Computer Science and Psychology and a minor in Mathematics. In school, Emma was able to explore her interests to the fullest by taking...
View all posts by Jielei (Emma) Zhu >

Shuheng Li

With the intention of becoming a great Data Scientist, Shuheng is an out-box thinker and self-motivator who focuses on Machine Learning and Statistics, and seeks for challenging projects and competitions that push his skills to an advanced level....
View all posts by Shuheng Li >

Shuye Han

View all posts by Shuye Han >

Related Articles

Machine Learning
Higgs Boson Signal Detection
Machine Learning
Decoding the God Particle
Machine Learning
What It Took to Score the Top 2% on the Higgs Boson Machine Learning Challenge

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application