NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Machine Learning > Higgs Boson Kaggle Machine Learning Competition

Higgs Boson Kaggle Machine Learning Competition

Ho Fai Wong, Wanda Wang, Rob Castellano and Yannick
Posted on Jun 12, 2016

Contributed by Wanda Wang, Rob Castellano, and Ho Fai Wong. They are currently in the NYC Data Science Academy 12 week full-time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on their fourth class project - Machine Learning (due on the 8th week of the program).

I. Introduction

The ATLAS experiment and the CMS experiment recently claimed the discovery of the Higgs boson, a particle theorized almost 50 years ago to have the role of giving mass to other elementary particles. The ATLAS experiment at the Large Hadron Collider (LHC) at CERN provided simulated data used by physicists in a search for the Higgs boson for the Higgs Boson Machine Learning Challenge organized on Kaggle.

The immediate goal of the Kaggle competition was to explore the potential of advanced classification methods to improve the statistical significance of the experiment. Specifically, the Higgs boson has many different processes through which it can decay. When it decays, it produces other particles via specific channels. The challenge consisted in developing a model to classify events corresponding to a Higgs to tau tau decay channel (signal) versus other particles and processes (background).

The Higgs Boson kaggle dataset was used in this analysis. The data exploration and machine learning was performed in R. All the code is available here.

II. Exploratory Data Analysis

Missingness

We carried out exploratory data analysis and quickly noticed that 11 of the 30 features contained missing data (initially recorded as -999). Upon closer inspection, 10 of the 11 features were related to jets (pseudo-particles) and specifically to the number of jets for a given event (recorded in the PRI_jet_num factor variable):

  • 0 jets: features related to jets were all missing data, understandably
  • 1 jet: features related to primary jets were no longer missing data, but those related to 2 or more jets were still missing
  • 2 or more jets: all features related to jets contained data

In other words, the jet number can be treated as a factor for missingness in these 10 features. The 11th feature with missing data was DER_mass_MMC (unrelated to number of jets), and was due to the topology of the event being too far from the expected topology (not an error in measurement).  The figure below shows the proportion of missingness as a function of jet number.

missingnessandjets

Given these observations on the missing data, we decided not to perform advanced imputation moving forward.

Principal Component Analysis

We then carried out Principal Component Analysis (PCA) to get a better understanding of the relationship between the 30 different features. When the features are plotted along the first principal component versus the second principal component (Figure below), it is clear that there are several clusters of features. These clusters of features indicates that collinearity is present among many of the features.

PC1PC2One relationship that particularly stood out across 9 principal components was between the feature derived mass and label (whether the particle is signal or background noise) which showed that the derived mass and label have a very strong relationship.

PCAanalysis

Importance of Mass

Looking deeper into the mass features, mass seems a good predictor of presence of the Higgs Boson since the derived mass of the Higgs Boson is different from other (Z and W) Bosons and subatomic particles.  Derived mass was the differentiating feature that the CERN scientists used to determine whether a particle was a Higgs Boson particle or not. The upper figure below is taken from the Science paper where the CERN scientists explain their findings.  The small blue area centered at 125 GeV represents signal from Higgs Boson while the red area centered at 98 GeV represents background noise from Z Bosons. The lower figure below, we reproduce this graph using the simulated data set from the Kaggle competition.  The simulated data set has a higher signal to background ratio than found experimentally to help with the model fitting. The dataset is weighted to take this discrepancy into account.

sciencemass

density

Correlation Matrix

Plotting a correlation matrix indicates strong covariance among subsets of the features.  This was taken into consideration when we explored using predictive models in the next section.

corrplot

Logistic Regression

We performed an initial logistic regression to further explore the dataset and investigate variable importance.

Initial Feature Engineering

14 Features with long-tailed distributions were log transformed to reduce the positive skew towards smaller values, generating a more uniform distribution e.g. DER_mass_jet_jet, the invariant mass of the two jets. This feature engineering to create more normal distributions can help with the model fitting. Although this feature engineering was not specifically used in our models, we would consider performing log transformations of these 14 features in the future.

pre_log_transformed

log_transformed

Variable Importance

The bar charts below illustrate the variable importance obtained from logistic regression for the saturated (30 variables) and the stepwise BIC model (18 variables). To note, 12 DER prefixed variables were kept as a result of the stepwise model, compared to only 6 PRI variables which remained. Derived variables thus have more descriptive value.

Sat_Model-VarImp

Stepwise-VarImp

Choice of AUC as model fit metric

In order to iteratively tune, improve, and combine various machine learning algorithms, a stable and reliable model fit metric was crucial. The Kaggle competition used approximate median discovery significance (AMS) scores to assess model performance but after performing online research and running a few simulations, it seemed that the AMS metric wasn't very stable. We opted instead to use the receiver operating characteristic area under the curve (AUC) which maximizes the true positive rate while also minimizes the false positive rate, and produces a stable, smooth and continuous function unlike AMS.

RF_AUC_plot

Logistic Regression Analysis

The logistic regression led to the following results:

  • Saturated Model: R.Squared: 0.20227
  • Stepwise BIC model: R.Squared: 0.20223

In comparing the deviance of the stepwise model to the deviance of the saturated model, the p-value for the overall test of deviance is > 0.65 (high). The AUC plots are also not very different from one another.

AUC-logistic

III. Models

We decided to use 3 machine learning algorithms to build an ensembled classification model:

  • Random Forest
  • Gradient Boosting Model
  • XGBoost

Random Forest

For the Random Forest (RF) model, we performed a 5-fold cross-validation to tune the number of randomly selected features used in a given decision tree (mtry).  Random forests with a wide mtry grid (mtry: 1, 2, 3, 6, 9) on small subset (20%) of the training data was fit. The best model had a mtry of 6. Random forests with a more narrow mtry grid (4, 5, 6, 7, 8) centered around the previous best mtry of 6 was fit to a large subset (80%) of the training data. The best RF model in this case had a mtry of 5, and this model was used for prediction.

The training prediction accuracy was calculated by using the RF model to predict on the 20% of the training data that was not used to fit the RF model. The AUC on this data was = 0.9071, and was considered satisfactory. This model was then fit to the test data and submitted to the public leader board of kaggle, where the prediction received an AMS score of 2.57949 and was ranked 1,311 on the leader board.

The variable importance graph (below) for the RF model shows the top 20 most important features in the model. The top 3 most important features are related to mass, which is in agreement with the science behind Higgs Boson detection.

rfvarimportancecolor

Gradient Boosting Model

For the Gradient Boosting Model (GBM), we performed a 5-fold cross-validation using a tuning grid, resulting in the following parameters:

  • shrinkage i.e. learning rate: 0.1
  • interaction_depth i.e. depth of variable interactions: 3
  • n.trees i.e. number of trees: 150
  • n.minobsinnode i.e. minimum number of observations in a terminal node: 10

In summary:

  • AUC on training data = .855
  • Kaggle rank = 1394
  • AMS = 2.30069

The GBM model performed slightly worse than Random Forest, for the chosen parameter values.

The variable importance graph for the GBM model also highlights the importance of the mass related features, but also the derived momentum as well.

gbm_importance

XGBoost

We then used XGBoost to attempt to improve the results achieved thus far. XGBoost is a fast gradient boosting algorithm implementing in C++ by Tianqi Chen. It allows for some parallel computation, more tuning parameters and is generally faster and performs better than gbm; this is due to coding efficiency and the fact that xgboost is not a completely greedy algorithm (unlike gbm).

We also performed a 5-fold cross-validation using a tuning grid, resulting in the following parameters:

  • nrounds i.e. number of trees: 200
  • max_depth: 5
  • colsample_bytree i.e. percent of parameters used at each split: 0.85
  • eta i.e. learning rate: 0.2

In summary:

  • AUC on training data = .9254
  • Kaggle rank = 1340
  • AMS = 2.49958

The XGBoost model performed better than GBM but still not as well as Random Forest, for the chosen parameter values.

The variable importance graph for the XGBoost model also highlights the importance of the mass related features.

xgboost_var_importance

Step-by-step code used to run XGBoost is shown below:

  • Read in the data, perform data munging, create train/test datasets:
  • Tune XGBoost parameters by iterating until model stable (1st and 5th tune shown below):
  • Use predictions on training dataset to plot ROC-AUC and determine best threshold:
  • Predict on test dataset and create submission file for Kaggle:

Ensembling

Finally, we ensembled our 3 models by majority vote in an attempt to further improve the overall model accuracy.

We ultimately achieved a slight improvement on the Random Forest model:

  • Kaggle rank = 1309
  • AMS = 2.58510

IV. Room for Improvement

As with any modeling, there is always room for improvement. In particular, feature engineering i.e. including additional variables could improve our model's predictive accuracy:

  • Introducing basic physics features, e.g. Cartesian coordinates of momentum variables
  • Advanced physics: e.g. CAKE variable
  • Better understand the physics of additional variables
  • Perform appropriate transformations on variables, such as the log transformations described above

In addition, using more machine learning algorithms, more sophisticated ensembling methods and esembling running different random seeds for the same model could further increase the discriminating accuracy of our final model.

In the meantime, if you have any comments, suggestions or insights, please comment below!

About Authors

Ho Fai Wong

With a diverse background in computer science and 9 years in Financial Services Technology Consulting, Ho Fai has been applying his analytical, problem-solving, relationship and team management skills at PwC, one of the Big Four consulting firms, focusing...
View all posts by Ho Fai Wong >

Wanda Wang

Wanda is excited about combining data science with compelling narratives to uncover new enterprise opportunities. With 5+ years of experience in the Investment Management field, including at both Citigroup and JPMorgan - Wanda thrives in demanding, client-driven environments...
View all posts by Wanda Wang >

Rob Castellano

Rob recently received his Ph.D. in Mathematics from Columbia. His training as a pure mathematician has given him strong quantitative skills and experience in using creative problem solving techniques. He has experience conveying abstract concepts to both experts...
View all posts by Rob Castellano >

Yannick

View all posts by Yannick >

Related Articles

Machine Learning
Data Housing Price Predictions Using Advanced Regression
Big Data
House Price Prediction with Creative Feature Engineering and Advanced Regression Techniques
Machine Learning
Allstate Auto Insurance Claims Severity Prediction
Machine Learning
Zillow Prize: Competing to Improve the Zestimate
Machine Learning
Machine Learning Kaggle Competition-Mission Zillow

Leave a Comment

Cancel reply

You must be logged in to post a comment.

Google August 31, 2021
Google Wonderful story, reckoned we could combine some unrelated information, nevertheless actually really worth taking a look, whoa did one discover about Mid East has got extra problerms too.
Google March 24, 2021
Google We came across a cool web page that you simply could get pleasure from. Take a look if you want.
Google September 15, 2020
Google Wonderful story, reckoned we could combine several unrelated data, nonetheless truly worth taking a look, whoa did 1 learn about Mid East has got much more problerms also.
Isaac Burbank November 29, 2017
It's actually very difficult in this full of activity life to listen news on TV, thus I only use web for that reason, and take the newest information.
Doug September 10, 2016
Thanks for the excellent advice, it actually is useful.
Lords Mobile Hack September 7, 2016
Great line up. We'll be linking to this great post on our site. Keep up the great writing.
http://www.porthacks.com/ September 7, 2016
Saved as a favorite, I actually like your site!
Magnolia August 14, 2016
Thanks for the great info, it really is useful.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application