NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Machine Learning > Kaggle Higgs Boson Machine Learning Challenge

Kaggle Higgs Boson Machine Learning Challenge

Deepak Khurana
Posted on Sep 1, 2016

This blog encompasses a comprehensive exploratory data analysis of Higgs Boson Machine Learning Challenge . In particular, I want to concentrate on feature engineering and selection. The response variable in this case is binary: - either there is a Higgs Signal or Background signal. In addition the other goal is to find if certain properties of data can help us decide which models are more appropriate for this problem and what should be our choice of parameters for those models. I implement all insights learned to get an AMS score of 3.53929.

Contents

  1. Data
  2. Mystery of  "-999.0"
  3. AMS Metric
  4. Data Preprocessing
  5. Response
  6. Features
  7. Feature Reduction
  8. Resultant Data for Predictive Modelling
  9. A sample xgboost model
  10. Conclusions
  11. Future Work

1  Data 

A description of the data is available from Kaggle's website for the competition . The data mainly consists of a training and a test file to make predictions and submit for evaluation. The snapshot below shows this information as it is displayed on the website. 

Screen Shot 2016-09-04 at 12.55.45 PM Screen Shot 2016-09-04 at 12.56.20 PM

 

 

 

 

 

2 Mystery of "-999.0"

One peculiar aspect of both the training and test data is that many values are -999.0 . The dataโ€™s description states that those entries are meaningless or cannot be computed and  โˆ’999.0 is placed for those entries to indicate that .

Screen Shot 2016-09-04 at 12.56.41 PM

 

One can analyze the nature and extent of these -999.0 entries by replacing them with "NA's" and doing this missing data analysis. We star by plotting a fraction of training data which is  -999.0 for different features and the combinations of features that occurs in the dataset

The plot to the left shows there are 11 columns where values are -999.0 with three subgroups of 1, 3 and 7 columns .  The combinations of features in the figure indicates there are 6 such combinations . Doing  same analysis on submission data gives exact same plot.  This indicates that original data can be subdivided into 6 groups, in terms of number of features having -999.0 .

 

undefined_data

Investigating the names of features with -999.0 shows that name of columns with -999.0 are  - ["DER_mass_MMC"  ],  ["PRI_jet_leading_pt"  ,   "PRI_jet_leading_eta"    "PRI_jet_leading_phi" ],   [ "DER_deltaeta_jet_jet"  ,  "DER_mass_jet_jet"  ,     "DER_prodeta_jet_jet"  , "DER_lep_eta_centrality", "PRI_jet_subleading_pt"  "PRI_jet_subleading_eta" "PRI_jet_subleading_phi" .

Diving further into technical documentation , it is evident that the group with 7 features is associated with two jets where one or no jet events are undefined . Furthermore, the group of 3 features is associated with at-least one jet production which is not defined in events when there is no jet . There are also certain observations where Higgs mass is not defined and this is not jet dependent . In conclusion, the original data can be subdivided into six groups in terms of whether or not the Higgs mass is defined and correspondingly if the event resulted in  one jet , more than one jet,  or no jet production at all.  (2 x3) . We  add two new features to incorporate this information .

 

3 AMS Metric

AMS is the particular evaluation metric for the competition . It's dependent on signal , background and weights associated with them in a peculiar manner as below.

 

Screen Shot 2016-09-05 at 5.38.01 AM

A plot of AMS for complete training data range for  s and b is shown below . AMS color is saturated for score above 4 as top leaderboard score is less than 4 and we want to concentrate on what makes those models good.

 

ROC_plot

The red region corresponds to high AMS scores which are linked to low false positive and high true positive rates That is expected,  but the peculiar aspect in the figure above is that there is a range of models which can achieve top ranked score of 4 and one such model would be identifying just 25 % of Higgs signal correctly, keeping the false positive rate to 3%, and still have the same AMS score of 4.

Next we investigate how much AMS score on training is influenced by performance in 6 data subgroups we identified in the last section. The red line in the previous plot is produced by a perfect prediction of the training data.The blue line below indicates AMS if every prediction is said to be a signal . For each of the data subgroups we calculate the AMS score by assigning either signal or background noise to all the events in that subgroup and using correct predictions for the rest of the data .

 

Screen Shot 2016-09-05 at 5.36.09 AM

The figure above shows that performance on data where the Higgs is not defined has almost no effect on the AMS score if everything is classified as background signal for these subgroups.  This behavior occurs  because there is very small signal in the data where the Higgs is not defined. Moreover the weight of the signals is low and the background noise has higher weights. These three factors make  identifying everything as background noise is as good as identifying everything correctly.

Thus, classifying all events where the Higgs is not defined as background  will perform equally well in terms of the AMS score . This reduces the subgroups of data for predictive modeling to 3 that is Higgs is defined and event resulted in either one jet or more than one jet or no jet at all.

4 Data Preprocessing

We prepare our four subgroups of data each for training and test  and remove all the features which have NA's and one's which have become redundant and scale the data . Finally we are left with only 60% of original data which is useful for predictive modeling. Doing same for submission data also leaves with 60% of predictive modeling data further cementing the idea to drop NA's , keep aside Higgs mass undefined data and splitting original data into three subgroups.

5 Response

We explore response of Higgs Kaggle  which is named Label in the dataset . Plotting histogram of the weights associated with the Labels indicate Label and weights are dependent on each other . The signals exclusively have lower weights assigned to them than the background and there is no intermingling between the two Labels . This is a pretty good reason for Kaggle to not provide weights for test set !

s_AND_b_weight_histogram

A density plot of background signal does show some distinction between three subgroups but not obvious ones so we would leave it at that and come back if needed.

 background_density

Histogram of signal shows that there are three weights or channels in which Higgs is being sought after. There are more "no jet" events with higher weights than "two or more jets" events which are larger in number but have lower weights .

  signal_hist

6 Features

Correlation 

We first look at the correlation among features for the three data subgroups . The order here is {2,1,0 }. Upper triangle which corresponds to DER (derived) features in all three seems to be correlated .  These are not observations but features engineered by CERN group using particle physics. The lower right triangles which contains all PRI (primitive) features  are not correlated . It could be a good idea to drop all the engineered correlated DER features .

 

Screen Shot 2016-09-05 at 1.01.19 PM Screen Shot 2016-09-05 at 1.00.25 PMScreen Shot 2016-09-05 at 12.59.43 PM

 

 

 

 

 

 

 

 

 

 

 

Principal Component Analysis  (PCA)

Examining the variance explained by the principal components used indicates there is a room for the reduction of features in all three subgroups . For example, for subgroup 2 , fifteen components out of 30 can explain 90 % of variance in the data and 24 components can explain 99% of variance .

PCA_percentage_plot_2

 Subgroup 1  {90,95,99} % = {11 ,13,15}                             Subgroup 0   {90,95,99}% = {9 ,10,13}

PCA_percentage_plot_1PCA_percentage_plot_0

 

 

 

 

 

 

 

 

 

 

 

PCA Eigenvectors

So our next challenge is to identify which original features do not have any influence or have very little influence in explaining the data . We start by multiplying the PCA eigenvalues and corresponding PCA eigenvectors and plotting the projections on a heat map. The product calculated in the previous step represents the transformed variance along the PCA directions . We then sorted this along the  horizontal axis  by the PCA eigenvalues' , starting  with  lowest on the left to the highest on the right . As is evident from the figure below, the lower eigenvalue transformed variance (starting from F30 and going towards F1) is zero which is what we would expect from the PCA plot from the last section

pc_features_2_3

We now sort the original features along the vertical axis with respect to their contribution to the variance. We do this for the transformed variance products in descending order.  We sum up the absolute value of the contributions and not just the contribution (as a feature can have positive or negative projection indicated by red and blue colors). At the end of this process, we are left with features that contribute the least variance displayed at the bottom.

The last 9 features in the above plot stand out from the rest of features as they have white blocks or zero contribution towards first four principal components. Another important observation is that they are all  phi or eta angle features.Similarly phi and eta angles in subgroup 1 and subgroup 0 show the same behavior. In next section we will see why they are least useful in explaining the data .

pc_features_1  pc_features_0

Density Plots 

Let's look at the density plot of the last 9 whitewashed features of subgroup 2. We see they share a few common characteristics . As pointed earlier, they  are all angle features for directions of particles and jets. In the case of the 5 phi angle features they have uniform and are identically distributed over the range for both the signal and background. This is true to some extent for eta features also but for phi it is strikingly true. Conceptually it does make sense as the particles and jet would scatter off in all directions whether or not they are signals or background. Thus, the variables will follow a uniform distribution.

first_9_density

The plot below contrasts this uniform distribution aspect of the least influential features to the density plots of the first 9 most influential features.

last_9_density

The same is evident for angle features for subgroup 1 and subgroup 0.

last_7_density_1 last_5_density_0

 

 

 

 

 

 

 

7 Feature reduction

The last section gave us plenty to think about with respect to which features are least influential in explaining the variance. But this particular Higgs Kaggle competition's success is determined by maximizing the AMS score.  Thus discarding any features or for that sake any amount of data may not be a wise decision . But we can still use the above insights as a guide I propose the following approach.

First Iteration   

  • Drop  DER features
  • Drop eta features
  • Drop phi  features
  • Assign Background to Higgs Undefined
  • Separate Predictive modeling on 3 subgroups

Second Iteration

  • Drop eta features
  • Drop phi  features
  • Assign Background to Higgs Undefined
  • Separate Predictive modeling on 3 subgroups

Third Iteration

  • Drop phi  features
  • Assign Background to Higgs Undefined
  • Separate Predictive modeling on 3 subgroups

Fourth Iteration  (if needed)

  • Assign Background to Higgs Undefined
  • Separate Predictive modeling on 3 subgroups

Fifth Iteration      (if needed)

  • Predictive modeling on 3 subgroups and Higgs Undefined

Sixth Iteration      (Nope. Do something else)

  • Brute force on full data

8 Resultant Data for Predictive Modeling

Letโ€™s evaluate the approach outlined above. Using only the data which matters will reduce computation power and time needed.  This model will be more accurate as it will reduce noise.  Moreover, less data to deal with means that one can try more computationally expensive models like Neural networks and SVM's and try to get lucky with automatic feature engineering .  To quantify  this benefit, we  plot the amount of data used at each modeling iteration.

punchline

9 A sample xgboost Model

We fit an xgboost decision tree model to our training data using the insights above . I chose xgboost here due to its having  low variance , low bias, high speed, and more accuracy.  We will follow the "Third Iteration" schema  from section 7 .  Let's prepare the training and testing sets by dropping the phi variables, and assigning the background noise label to data where Higgs is not defined, and splitting the data where Higgs is defined into 3 subgroups.

"AUC" is the metric of choice here as it responds well to misclassification errors.  The optimal number of trees to maximize the AUC score will be found by cross validation. We fit the whole training data to  the optimal number of trees for each dataset and make predictions for the test data to submit to Kaggle. 

The Private AMS score for this model is 3.53929 . That's satisfactory for me at the moment considering we didn't tune any hyper-parameters except for the number of trees and set the same threshold for all three subgroups. One need to do a grid search to find the three different thresholds.

 

 

We use "AUC" as our choice of metric as it responds well to misclassification errors.  We use cross validation to find numbers of trees which maximizes AUC .

 

We fit the whole training data to most optimal number of trees for each dataset and make predictions for test data and prepare file for submission on Kaggle.

Private AMS score for this model is 3.53929 . That's satisfactory for me at the moment considering we didn't tune any hyper-parameters except number of trees  here .  Plus we set same threshold for all three subgroups. One need to do a grid search to find the three different thresholds.

 

10 Conclusions

  •  -999.0 are placeholders for undefined values Any attempt of imputation is plain wrong,won't work , make things bad.
  • -999.0 split original data into six subgroups i.e. if Higgs mass is defined (or not) and correspondingly how many jets are formed (0 , 1 or more than 1)
  • A high AMS score requires predominantly a low false postive rate and also a high true positive rate
  • Setting everything to background for observations where Higgs mass is not defined has a very tiny effect on AMS score
  • One can effectively do predictive modeling on three subgroups of data where Higgs mass is defined and correspondingly how many jets are formed (0 , 1 or more than 1)
  • The weights and the signal are dependent on each other.
  • The Higgs is being sought in three channels in the data
  • The DER variables are correlated to each other and the PRI variables are uncorrelated for most part
  • The angle features of phi and eta have least influence on explaining the variance
  • There is only 16 % of uncorrelated data which explains most of the variance in the data
  • A simple xgboost model gives a respectable AMS of 3.53929

11 Future Work

  • Grid search for hyperparameters of xgboost model and thresholds
  • Ensemble Methods
  • Stacking
  • Feature engineering
  • Use "iteration 1" , "2" with the least amount of uncorrelated and significant data with more computationally expensive methods such as Neural networks and SVMs to do automatic feature engineering

 

About Author

Deepak Khurana

Deepak holds a Masters Degree in Physics from the Indian Institute of Technology Kharagpur, one of the top engineering school in India. He was then awarded the Henry M. MacCracken fellowship at New York University to pursue a...
View all posts by Deepak Khurana >

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application