Higgs Boson Machine Learning Challenge

Zachary Escalante
Posted on Jun 10, 2016

Contributed by by  and . They are currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between January 11th to April 1st, 2016. This post is based on their fourth class project - Machine learning(due on the 8th week of the program).

The Data

The training data for this project consisted of a single data frame with 33 columns and 250,000 rows. 30 of the columns were independent variables with numerical values: 29 floating points, 1 integer. One of the columns, EventId, was an integer observation index not to be used for prediction. Another column, Weight, was a floating point artifact from simulation the data originated from. Finally, the single dependent variable, Label, was a two-level factor 'b' and 's', for background and signal. For a more complete description of the data, the physics motivation for the problem, and the accuracy metric used, please visit https://www.kaggle.com/c/higgs-boson. The test data consisted of same variables as the training data, except for Weight and Label, and had 550,000 rows. Our objective was to predict whether each row of the test data corresponded to a 'b' (background) or 's' (signal).

MissingnessMissingnessCombinations

Before deciding on the type of predictive model to apply to the decay data, we examined the data's missingness. We found six missingness combinations in the 33 columns and 250,000 rows. We found exactly the same six missingness combinations in the test data. This indicated to us that the missingness was not due to random instrumentation error, but rather systematic. To understand why the missingness might follow the pattern we observed, we studied the particle physics background. We found two sources of missingness that explained all the missing data: topology and number of jets.

A topology of a particular particle decay is the collection of decay channels that result in particular end-stage particles that are picked-up by the detector. For example, the topology of the Higgs boson decay signal that corresponds to the 's' Label in our data set corresponded to: one Higgs boson decaying into two tau particles; one tau decaying into a lepton and two neutrinos; and the other tau decaying into a hadronic tau and one neutrino.

TBD Final Presentation.001

For certain topologies, the phycisists could generate estimates for the mass of the candidate Higgs boson.

TBD Final Presentation.004

For observations where a topology was recognized, a floating point value for DER_mass_MMC was present. When the topology was unexpected, the value for DER_mass_MMC was missing.

Jets are pseudo-particles that result from top quark decay, one of the three expected background decay topologies. 12 of the 30 columns of independent data were either directed measurements of jet quantities or quantities derived from those measurements. As a result, when, for instance, no jets were observed (not all decay topologies involve jets), the data in the jet-related columns were missing.

The six categories of missingness therefore corresponded to 1) either a familiar or unfamiliar topology; 2) 0, 1, or 2+ observed jets. 2*3 = 6: the number of missingness categories in the data. Once we understood that the missingness in the data corresponded to different physical paradigms, we decided to model each of the six missingness-categories as it's own data set. We analyzed each category in isolation from the others, both in our elasticnet regression and in our SVC construction.

TBD Final Presentation2.001

Elasticnet Regression

To get some insight into which of the 30 independent variables might be particularly influential for this problem, we used elastic net regression. Elasticnet regression is a weighted combination of ridge and lasso regression, designed to minimize the sum of the regressions residual sum of squares (RSS) and a measure of the model complexity. The idea is that, by minimizing this quantity, we can obtain the simplest model necessary to explain the data well enough. The way this method simplifies the model is by shrinking coefficients of variables that do not contribute to effectively predicting the dependent variable in a regression paradigm. We tuned a unique elasticnet regression to each of the six data categories. In each regression, values for alpha (controls the relative weights of the ridge and lasso model complexity penalties) and lambda (controls influence of model complexity penalty vs. RSS) are selected. After we tuned regressions for each of the six categories, we examined the variables that still had large coefficients. Before the individual elasticnet regressions, each of the six categories had 17-30 variables. After the regression, each of the columns had 2-6 variables.

Interestingly, in all six regressions on different data subsets, two variables appeared important: DER_deltar_lep_tau and DER_pt_ratio_lep_tau. DER_deltar_lep_tau is a measure of the angular difference between the emitted lepton and tau particles. DER_pt_ratio_lep_tau is a measure of the relative transverse momenta of the emitted lepton and tau particles. Since these two variables survived the elasticnet regression pruning process in all six cases, it seemed to us like they may have relatively large predictive power. These results were only subjective, however, and would need to be confirmed by SVM modelling of the reduced variables.

Machine Learning Algorithm

Our group decided to experiment with fitting a Support Vector Machine (SVM) to each of the six subsets of our data in order to classify both signal and background noise. SVMs are designed to cut a hyperplane between data points in high dimensional spaces for classification purposes. Considering the number of variables in each our our subsets, we believed this was an especially appropriate algorithm for this data set.

There were three attributes of our model which we had to consider: feature selection (addressed using our Elastic Net Regression), kernel selection and the parameters for our kernel (cost and lambda). For our feature selection, we decided to tune an SVM to each of our subsets with only the variables that we found to be significant for the respective data frame (as per the Elastic Net regression). We concurrently tuned another model with the full data frame including all variables for each subset, with the goal of deciphering whether the reduced data sets had comparable predictive power to the full data sets.

Code for tuning our SVM model on our first data frame:

https://gist.github.com/adamcone/a00ea5e7f05560ac890d909bc4349fdd

Function to choose the right model to predict each instance of the test set:

https://gist.github.com/adamcone/c8c5770e3b1eecceb539f37616f856e6

Full Model vs Reduced Model

We tuned our models using 21% of our data and 2-fold cross validation (that computation time to tune our model using 5 or 10-fold cross validation would have prohibited us from finishing the tuning process in time to submit our predictions. We see in the chart below, that using Elastic Net regression we were able to eliminate upwards of 80% of our data and maintain a high degree of accuracy in our training set

Accuracy_Table

Prediction Comparison

We first submitted our predictions using the models tuned on our reduced data set and received a Kaggle score of 1.218 (rank of 1633 at the time of submission). We then submitted our predictions using the model we fit to the full data set and scored just over 3.02 (a rank of approximately 1023 on the leaderboard).

Conclusion

Variables that were eliminated by Elastic Net Regession were still useful in tuning our SVM model. Even though the coefficients of many of our variables had been pushed to zero, they still provided meaningful the basis for Support Vectors to create additional hyperplanes with which our model was able to significantly improve the classification process (as evidenced by the improvement of our Kaggle score from 1.218 to 3.02). We had expected that the full model would perform moderately better than the fitted model, but we were surprised by exactly how inferior the fitted model performed. It was evident that our SVM models performed best with no variable elimination, no matter the significance of those variables in Elastic Net regression.

The SVM was very expensive computationally (one model could take upwards of 3 hours to tune given 5-fold cross validation), but seemed to do well as a predictor of classification (especially compared to other teams that used Random Forests).

Improvements

  1. Use the results of our Elastic Net regression to perform Logistic regression as opposed to SVM for classification.
    • We realized that using Elastic Net regression might be a poor method of determining what variables would be important for cutting hyperplanes in our SVM model. In the future we would apply a Logistic regression model using the variables selected from the Elastic Net regression.
  2. We would also like to understand what the primary drivers are behind the computation time for tuning our SVM. It was clear that an increase in the number of certain parameters greatly increased our run-time, but it was much harder to decipher what those parameters which contributed to the increase.
  3. Add additional cross validation parameters:
    • Due to the length of time associated with tuning each model, we were unable to implement the necessary cross-validation. Ideally, k = 10 or 20 would be our preferred number of folds.

About Author

Zachary Escalante

Zachary Escalante

Zach Escalante's path to the field of Data Analysis has not been a conventional one. Born and raised in South Florida, Zach did his first bachelor's degree in Finance at Florida Atlantic University (FAU). Following the completion of...
View all posts by Zachary Escalante >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp