Higgs Boson Machine Learning Challenge
Contributed by by Adam Cone and Zachary Escalante. They are currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between January 11th to April 1st, 2016. This post is based on their fourth class project - Machine learning(due on the 8th week of the program).
The Data
The training data for this project consisted of a single data frame with 33 columns and 250,000 rows. 30 of the columns were independent variables with numerical values: 29 floating points, 1 integer. One of the columns, EventId, was an integer observation index not to be used for prediction. Another column, Weight, was a floating point artifact from simulation the data originated from. Finally, the single dependent variable, Label, was a two-level factor 'b' and 's', for background and signal. For a more complete description of the data, the physics motivation for the problem, and the accuracy metric used, please visit https://www.kaggle.com/c/higgs-boson. The test data consisted of same variables as the training data, except for Weight and Label, and had 550,000 rows. Our objective was to predict whether each row of the test data corresponded to a 'b' (background) or 's' (signal).
Missingness
Before deciding on the type of predictive model to apply to the decay data, we examined the data's missingness. We found six missingness combinations in the 33 columns and 250,000 rows. We found exactly the same six missingness combinations in the test data. This indicated to us that the missingness was not due to random instrumentation error, but rather systematic. To understand why the missingness might follow the pattern we observed, we studied the particle physics background. We found two sources of missingness that explained all the missing data: topology and number of jets.
A topology of a particular particle decay is the collection of decay channels that result in particular end-stage particles that are picked-up by the detector. For example, the topology of the Higgs boson decay signal that corresponds to the 's' Label in our data set corresponded to: one Higgs boson decaying into two tau particles; one tau decaying into a lepton and two neutrinos; and the other tau decaying into a hadronic tau and one neutrino.
For certain topologies, the phycisists could generate estimates for the mass of the candidate Higgs boson.
For observations where a topology was recognized, a floating point value for DER_mass_MMC was present. When the topology was unexpected, the value for DER_mass_MMC was missing.
Jets are pseudo-particles that result from top quark decay, one of the three expected background decay topologies. 12 of the 30 columns of independent data were either directed measurements of jet quantities or quantities derived from those measurements. As a result, when, for instance, no jets were observed (not all decay topologies involve jets), the data in the jet-related columns were missing.
The six categories of missingness therefore corresponded to 1) either a familiar or unfamiliar topology; 2) 0, 1, or 2+ observed jets. 2*3 = 6: the number of missingness categories in the data. Once we understood that the missingness in the data corresponded to different physical paradigms, we decided to model each of the six missingness-categories as it's own data set. We analyzed each category in isolation from the others, both in our elasticnet regression and in our SVC construction.
Elasticnet Regression
To get some insight into which of the 30 independent variables might be particularly influential for this problem, we used elastic net regression. Elasticnet regression is a weighted combination of ridge and lasso regression, designed to minimize the sum of the regressions residual sum of squares (RSS) and a measure of the model complexity. The idea is that, by minimizing this quantity, we can obtain the simplest model necessary to explain the data well enough. The way this method simplifies the model is by shrinking coefficients of variables that do not contribute to effectively predicting the dependent variable in a regression paradigm. We tuned a unique elasticnet regression to each of the six data categories. In each regression, values for alpha (controls the relative weights of the ridge and lasso model complexity penalties) and lambda (controls influence of model complexity penalty vs. RSS) are selected. After we tuned regressions for each of the six categories, we examined the variables that still had large coefficients. Before the individual elasticnet regressions, each of the six categories had 17-30 variables. After the regression, each of the columns had 2-6 variables.
Interestingly, in all six regressions on different data subsets, two variables appeared important: DER_deltar_lep_tau and DER_pt_ratio_lep_tau. DER_deltar_lep_tau is a measure of the angular difference between the emitted lepton and tau particles. DER_pt_ratio_lep_tau is a measure of the relative transverse momenta of the emitted lepton and tau particles. Since these two variables survived the elasticnet regression pruning process in all six cases, it seemed to us like they may have relatively large predictive power. These results were only subjective, however, and would need to be confirmed by SVM modelling of the reduced variables.
Machine Learning Algorithm
Our group decided to experiment with fitting a Support Vector Machine (SVM) to each of the six subsets of our data in order to classify both signal and background noise. SVMs are designed to cut a hyperplane between data points in high dimensional spaces for classification purposes. Considering the number of variables in each our our subsets, we believed this was an especially appropriate algorithm for this data set.
There were three attributes of our model which we had to consider: feature selection (addressed using our Elastic Net Regression), kernel selection and the parameters for our kernel (cost and lambda). For our feature selection, we decided to tune an SVM to each of our subsets with only the variables that we found to be significant for the respective data frame (as per the Elastic Net regression). We concurrently tuned another model with the full data frame including all variables for each subset, with the goal of deciphering whether the reduced data sets had comparable predictive power to the full data sets.
Code for tuning our SVM model on our first data frame:
Function to choose the right model to predict each instance of the test set:
Full Model vs Reduced Model
We tuned our models using 21% of our data and 2-fold cross validation (that computation time to tune our model using 5 or 10-fold cross validation would have prohibited us from finishing the tuning process in time to submit our predictions. We see in the chart below, that using Elastic Net regression we were able to eliminate upwards of 80% of our data and maintain a high degree of accuracy in our training set
Prediction Comparison
We first submitted our predictions using the models tuned on our reduced data set and received a Kaggle score of 1.218 (rank of 1633 at the time of submission). We then submitted our predictions using the model we fit to the full data set and scored just over 3.02 (a rank of approximately 1023 on the leaderboard).
Conclusion
Variables that were eliminated by Elastic Net Regession were still useful in tuning our SVM model. Even though the coefficients of many of our variables had been pushed to zero, they still provided meaningful the basis for Support Vectors to create additional hyperplanes with which our model was able to significantly improve the classification process (as evidenced by the improvement of our Kaggle score from 1.218 to 3.02). We had expected that the full model would perform moderately better than the fitted model, but we were surprised by exactly how inferior the fitted model performed. It was evident that our SVM models performed best with no variable elimination, no matter the significance of those variables in Elastic Net regression.
The SVM was very expensive computationally (one model could take upwards of 3 hours to tune given 5-fold cross validation), but seemed to do well as a predictor of classification (especially compared to other teams that used Random Forests).
Improvements
- Use the results of our Elastic Net regression to perform Logistic regression as opposed to SVM for classification.
- We realized that using Elastic Net regression might be a poor method of determining what variables would be important for cutting hyperplanes in our SVM model. In the future we would apply a Logistic regression model using the variables selected from the Elastic Net regression.
- We would also like to understand what the primary drivers are behind the computation time for tuning our SVM. It was clear that an increase in the number of certain parameters greatly increased our run-time, but it was much harder to decipher what those parameters which contributed to the increase.
- Add additional cross validation parameters:
- Due to the length of time associated with tuning each model, we were unable to implement the necessary cross-validation. Ideally, k = 10 or 20 would be our preferred number of folds.