Signal to Noise: Finding the Higgs Boson
Machine learning can help us understand our universe
An experiment was designed to find the last missing particle in the current model of physics, the theorized Higgs Boson. This elusive particle was found in 2012 and the discovery warranted a Nobel Prize in 2013. The experiment was performed at the European Laboratory for Particle Physics, whose abbreviation CERN is derived from its French name.
The analysis of the massive amount of data generated by the experiment was a challenge. One primary task of the analysis was to discern the particle signal from the background noise in the data. Machine learning can help people make this distinction, and in this article we will discuss which methods are best for this task and why we chose to use them.
The data come from the Higgs Boson Challenge on the data science website Kaggle, which hosts competitions for model creation and prediction. Click here to see the code we used for our models.
An ensemble of models generates the best prediction
Ensemble models provide better predictions than singular models because they reduce the bias of the model, reduce the variance of the model, and are unlikely to overfit the training data. Three possible variants of ensembling are averaging, voting, and weighting.
Averaged ensemble models take the outputs from several models and average their results. Voting ensemble models take the categorical output of multiple models and allow the models to vote on the outcome. If an even number of models are included, a way of breaking ties must be defined. Weighted ensemble models choose weights to apply to each of the model outputs, enabling individual models to exact larger influence. We created a three-model weighted ensemble and will be discussing the results, the ensemble, and the individual models in the ensemble in this blog post.
Three models were combined in an ensemble: XGBoost, Random Forest, and Logistic Regression
The ensemble model was constructed of three models: an XGBoost model, a random forest model, and a logistic regression model. The output probabilities of each model were weighted by coefficients which acted as weights for each of the models. The coefficients were constrained to be between 0 and 1 and their sum must necessarily be one so that the output probabilities would also lay between 0 and 1. The equation used was:
alpha * p_XGB + beta * p_RF + gamma * p_LOGIT = p_ensemble
A threshold, t, was then applied to the ensemble probabilities to classify the observations into signal and background. The four tuning parameters to the ensemble model, alpha, beta, gamma, and t, were trained with a grid search on bootstrapped subsets of the training data. The grid was constructed by splitting each of the four tuning parameters into 100 values between 0 and 1 and constraining the sets to those where the first three sum to 1. Before discussing the outcome of the ensembling, we will first turn to each of the individual models and their performance.
The low time and computational costs of XGBoost render it the most effective singular model
EXtreme Gradient Boosting (XGBoost), authored by Tianqi Chen, Tong He and Michael benesty, is โan efficient implementation of (the) gradient boosting framework.โ (https://cran.r-project.org/web/packages/xgboost/xgboost.pdf). As such, it is a computationally friendly and extremely powerful learning algorithm that is used in many Kaggle competition to provide highly accurate predictions. We chose to tune and train a XGBoost model, despite its limited interpretability, in an attempt to maximize our AMS score and ultimate place in the Kaggle competition. We began with code from the authors that had been specifically written for the Higgs Boson Challenge. Although optimal parameters had also been provided with the code, we chose to write our own tuning grid-search to seek for even better sets of parameters. By varying the learning rate (eta), maximum depth of individual trees (max_depth), and the number of trees built in each implementation of the model (nrounds), we arrived at an optimal parameter set, {eta = .05, max_depth = 9, nrounds = 120}. Results were compared by calculating an AMS score on the training data and these parameters selected provided the highest AMS score. It should be noticed that there are numerous other sets of parameters that could be fine-tuned, and given more time/computing power, we would have chosen to hyper-tune the parameters further. However, the ultimate results provide evidence of the power and predictability of the algorithm. AMS scores on the training data were in the range of 2.5. When applying the XGBoost model we built to the test data and submitting to Kaggle, we were able to achieve an AMS score of 3.573, which would have placed us in place 661.
The primary drawback of the algorithm ,however, lies in its interpretability. We were unable to speak to the contribution of individual features to the final prediction of the model. However, the goal of Kaggle competitions is primarily predictive power and the emphasis on predictability is negligible. For interpretability, we built a logistic model, which will be discussed below.
Our initial inspection of the dataset highlighted missingness in the data, much of which was due to the number of โjetsโ for which there was data. We trained the XGBoost model using datasets split by the number of jets, to see if there would be greater predictability if the missingness was artificially dealt with by humans. As it turns out, the computer is much better at managing the missingness in the data, especially when using an algorithm as powerful as XGBoost.
The Random Forest model was computationally expensive and but provided more interpretability than boosting
One of the main reasons for choosing random forests to be a part of the ensemble included the algorithmโs ability to pick a subset of predictors at each node split of an individual tree. This behavior leads to a decorrelated solution once aggregated across several trees, which reduces the variance of the final predictions. In other words it prevents the overfitting that other models may be prone to.
We approached the random forest in the same way we did the XGBoost in that we ran it on the data set that had been split by the number of jets, and also on the complete training set.
Parameter tuning was done on a leaner training file consisting of 5,000 training and 5,000 testing records using the Caret package in r. The substantially reduced size was considered mainly for performance reasons as the algorithm is fairly computationally heavy. We attempted to determine optimal parameter settings by trying to maximize AMS score as well as minimize Out of Bag error on the subset of the training data. The parameter tuning step yielded an optimal level of 7 for the โmtryโ variable which denotes the number of predictors that should be considered at each split in the tree (from among the 30 available predictors). A tree depth of 10 terminal nodes was also found to be the best, though growing deeper trees would always help learn more from training data, while not necessarily overfitting the test data.
After utilizing these parameters on the data sets split by the number of jets, we achieved a fairly low AMS score of 2.53. We were surprised at these results since this split data required minimal imputation of missing values and it more closely represented a logical view of the data that should be present by the number of jets.
On our second attempt with the unsplit training data set of 250,000 records, we needed to impute a larger amount of data, including some variables that had 30 to 70% of their data missing. We decided to use a bag-imputed approach, which uses a bagged model of the predictors that have data to impute the values for the predictors that donโt. We made an additional tweak to let the random forest learn as deeply as possible without pruning. While there is a notion of overfitting on the training data here, it still yielded a better result on the test data, with a resulting AMS score of 2.9. As such we used the probabilities from this modelโs output in the final ensemble.
Another reason for using Random Forests was for the interpretability that comes with the variable importance scores. This analysis quantifies the increase in mis-classification error if a variable is not included in the model. The table below summarizes the variable importance of the final model (top 20 metrics included). The caret package produces importance scores that have been scaled to be on a 0 to 100 scale for ease of comparison.
As can be seen from the table above, three out of the top four variables are mass related variables, suggesting that this is an important aspect of the data that would help us identify signals in a more efficient way versus the background noise. There is one more mass related variable included towards the bottom of the table, but this was one of the metrics that had 70% of its data imputed, likely reducing the amount of variance in the metric.
Logistic regression provides the most interpretability but provided inferior prediction ability
Compared to the other models we built, the logistic model was simple to build, quick to run and optimize, and offered better predictability. Like with the other algorithms, we split the data by jet number and trained three individual models, and compared to a logistic model trained on the full dataset. A first set of models was trained using all variables/features, however when reviewing summary outputs of the logistic model in R, it became clear that there were a significant number of variables which we could not say were different than zero with any confidence. We excluded these features from the final models. The table below shows the variables we were able to confidently say were significant to the final model. Interestingly, a large number of variables appeared to be significant in each of the three split datasets and the full model. With more time, we would want to look further at the excluded variables to see if they do not contribute to the predictability in other models.
The final models were trained using a reduced number of variables/features, and individual tests were run on the full dataset and the split datasets. AMS scores differed significantly between the two methods, and the split dataset ultimately provided the best accuracy and thus AMS score on the training data. While scores compared to the XGBoost model were low, we included it in the ensemble model to see if it could provide an incremental boost in score. The full model ultimately scored 1.20 on the test data, while the split data, utilizing a threshold that maximized the AMS on the full dataset, scored 1.83 on the test data. This is the ideal model to include in the ensemble model.
The ensemble model generated the most accurate predictions and would be a prime candidate for feeding into a stacked model
The models were combined in an ensemble model, first with only the logistic regression and XGBoost models and then once again with the random forest model. The optimized weights were as follows:
(XBG) | (RF) | (LOGIT) | (Threshold) | Kaggle Test AMS | |
2-model ensemble | 0.98 | -- | 0.02 | 0.93 | 3.60534 |
3-model ensemble | 0 | 0.9 | 0.1 | 0.45 | 2.79415 |
The results for the 2-model ensemble are not altogether surprising: it heavily favors the individual model which produces the highest AMS score, the XGBoost model. The AMS score of the ensemble was 3.605 when uploaded to the Kaggle private leaderboard, which would have garnered place 569. This was our best result.
The figure above shows the training AMS score on the y-axis and the model weights on the x-axis for the optimized probability threshold of 0.93 for the signal/noise classification. That means that the model weights represented by a horizontal line on the graph will equal to 1, and the intersection of that horizontal line with the y-axis will show the AMS score of that particular ensemble. The highest AMS score on the training data was achieved with the XGBoost weight at 0.98 and the logit weight at 0.02.
The results for the 3-model ensemble were more surprising and require deeper investigation to fully understand. The most accurate model, the XGBoost model, was given a weight of 0 and therefore had no influence on the 3-model ensemble results. The random forest dominated the weights, sharing 10% of the outcome probability weight with the logistic regression model and maximizing the AMS with a threshold of 0.45. The test AMS on the Kaggle private leaderboard was 2.794, significantly worse than the XGBoost + Logit model, indicating that the dominating model, the random forest, was fit too well to the training data.
Since the weights were chosen from the training data, it makes sense that the random forest model would dominate. In order to reign in the influence of the random forest, a simple voting or average ensemble would be a logical next step to still include the benefits of each of the models while not allowing one to dominate the results. These results could then be added as another predictor variable in the original dataset and fed into another XGBoost model. This is known as stacking and has been shown in other contests to produce even more favorable results.
Takeaway
- A weighted ensemble of two models produced the best predictive results
- The individual models predicted best in the following order: XGBoost, random forest, logistic regression
- The random forest and logistic regression models provided interpretability that the XGBoost and ensembles could not, showing the mass variables to be most influential
- Weighted form ensembles might favor models which are fit to the training data and might not perform well on the test data
Future Work
- Stacking the results of the ensemble model
- Investigating averaging and voting ensembles