Kaggle Higgs Boson Machine Learning Challenge
This blog encompasses a comprehensive exploratory data analysis of Higgs Boson Machine Learning Challenge . In particular, I want to concentrate on feature engineering and selection. The response variable in this case is binary: - either there is a Higgs SignalΒ or Background signal. In addition the other goal is to find if certain properties of data can help us decideΒ which models are more appropriate for this problem andΒ what should be our choice of parameters forΒ those models. I implement all insights learned to get an AMS score ofΒ 3.53929.
Contents
- Data
- Mystery of Β "-999.0"
- AMS Metric
- Data Preprocessing
- Response
- Features
- FeatureΒ Reduction
- Resultant Data for Predictive Modelling
- A sample xgboost model
- Conclusions
- Future Work
1 Β DataΒ
A description of the data is available from Kaggle's website for the competition . The data mainly consists of a training and a test file to makeΒ predictions and submit for evaluation. The snapshot below shows this information as it is displayed on the website.Β
2Β Mystery of "-999.0"
One peculiar aspect of both the training and test data is that many values areΒ -999.0 . The dataβs description statesΒ that those entries are meaningless or cannot be computedΒ andΒ Β β999.0Β is placed for those entriesΒ to indicate that .
One can analyze the nature and extent of these -999.0 entries by replacing them with "NA's" and doing this missing data analysis. We starΒ by plotting a fraction of trainingΒ data which is Β -999.0 for different featuresΒ and the combinations of features that occurs in the dataset
The plot to the left shows there are 11 columns where values are -999.0 with three subgroups of 1, 3Β and 7Β columns . Β The combinations of features in the figureΒ indicates there are 6Β such combinations . DoingΒ Β same analysis on submission data gives exact same plot. Β This indicates that original data can be subdivided into 6Β groups, in terms ofΒ number of featuresΒ having -999.0 .
Investigating the names of featuresΒ with -999.0 shows that name of columns with -999.0 areΒ - ["DER_mass_MMC" Β ], Β ["PRI_jet_leading_pt" Β , Β "PRI_jet_leading_eta" Β Β "PRI_jet_leading_phi" ], Β [Β "DER_deltaeta_jet_jet" Β , Β "DER_mass_jet_jet" Β , Β Β "DER_prodeta_jet_jet" Β ,Β "DER_lep_eta_centrality", "PRI_jet_subleading_pt" Β "PRI_jet_subleading_eta" "PRI_jet_subleading_phi" .
Diving further into technical documentation , it is evident thatΒ the group with 7 featuresΒ isΒ associated with two jets where oneΒ or no jet events are undefinedΒ . Furthermore, the group of 3 featuresΒ isΒ associated with at-least one jet production which is not defined in events when there is no jet . There are also certainΒ observations where Higgs mass is not defined and thisΒ is not jet dependent . In conclusion, the original data can be subdivided into six groupsΒ in terms of whether or not the Higgs mass is defined and correspondingly if the event resulted in Β one jet , more than one jet, Β or no jet production at all. Β (2 x3) . We Β add two new features to incorporate this information .
3 AMS Metric
AMS is the particular evaluation metric for the competition . It's dependent on signal , background and weights associated with them in a peculiar manner as below.
AΒ plot of AMS for complete training data range forΒ Β s and b is shown belowΒ . AMS color is saturated for score above 4 as top leaderboard scoreΒ isΒ less than 4 and we want to concentrate on what makes those models good.
The red region corresponds to high AMS scores which are linked to low false positive and high true positive rates That is expected, Β but the peculiar aspect in the figure above is that there is aΒ range of models which can achieve top ranked score of 4Β andΒ one such model would be identifying just 25 % of Higgs signal correctly, keeping the false positive rate to 3%, and still have the same AMS score of 4.
Next we investigate how much AMS score on training isΒ influenced by performance in 6 data subgroups we identified in the last section. The red line in the previous plotΒ isΒ produced by a perfect prediction of the trainingΒ data.The blue line below indicates AMSΒ if every prediction is said to be a signal . For each of the data subgroups we calculate the AMS score by assigning either signal or background noise toΒ all the events in that subgroup and using correct predictions for the rest of the data .
The figureΒ above shows that performance on data where the Higgs is not defined has almost no effect on the AMS score ifΒ everything is classified as background signal for these subgroups. Β This behavior occurs Β because there is very small signal in the data where the Higgs is not defined. MoreoverΒ the weight of the signals is low and the background noise has higher weights. These three factors make Β identifying everything as background noise is as good as identifying everything correctly.
Thus, classifying all events where the Higgs is not defined as backgroundΒ willΒ perform equally well in terms of the AMS score . This reduces the subgroups of data for predictive modeling to 3Β that is Higgs is defined and event resulted in either one jet or more than one jet or no jet at all.
4 Data Preprocessing
We prepare our fourΒ subgroups of data each forΒ training and testΒ Β and remove all the features which have NA's and one's which haveΒ becomeΒ redundant and scale the data . Finally we are left with only 60% of original data which is useful for predictive modeling. Doing same for submission data also leaves with 60% of predictive modeling data further cementing the idea to drop NA's , keep aside Higgs mass undefined data and splitting original data into three subgroups.
5 Response
We explore response of Higgs Kaggle Β which is named Label in the dataset . Plotting histogram of the weights associated with the Labels indicate Label and weights are dependent on each other . The signals exclusively have lower weights assigned to them than the background and there is no intermingling between the two LabelsΒ . This is a pretty good reason for Kaggle to not provideΒ weights for test set !
A density plot of background signal does show some distinction between three subgroups but not obvious ones so weΒ would leave it at that and come back if needed.
Histogram of signal shows that there are three weights or channels in which Higgs is being sought after. There are more "no jet" events with higher weights than "two or moreΒ jets" events which are larger in numberΒ but haveΒ lower weights .
Β Β
6 Features
CorrelationΒ
We first look at the correlation among features for the three data subgroups . The order here is {2,1,0 }. Upper triangle which corresponds to DER (derived) features in all three seems to be correlated . Β These are not observations but features engineered by CERN group using particle physics. The lower rightΒ triangles which contains all PRIΒ (primitive)Β featuresΒ are not correlated . It could be a good idea to drop all the engineered correlated DER features .
Principal Component Analysis Β (PCA)
Examining the variance explained by the principal components used indicates there is aΒ room for theΒ reduction of features in all three subgroups . For example, for subgroup 2 , fifteen components out of 30 can explain 90 % of variance in the data and 24 components can explain 99% of variance .
Β Subgroup 1 Β {90,95,99} % = {11 ,13,15} Β Β Β Β Β Β Β Β Β Β Β Β Β Β Subgroup 0Β Β {90,95,99}% = {9 ,10,13}
PCA Eigenvectors
So our next challenge isΒ to identify which original features do not have any influence or have very little influence in explaining the data . We start by multiplying the PCA eigenvalues and corresponding PCA eigenvectors and plotting the projections on a heat map. The product calculated in the previous step represents the transformed variance along the PCA directions . We then sorted this along the Β horizontal axis Β by the PCA eigenvalues' , starting Β with Β lowestΒ on the left to theΒ highest onΒ the right . As is evident from the figure below, the lower eigenvalue transformed variance (starting from F30 and going towards F1) is zero which is what we would expect from the PCAΒ plot from the last section
We now sort the original features along the vertical axis with respect to their contribution to the variance. We do this forΒ the transformed variance productsΒ in descending order. Β We sum up the absolute value of the contributions and not just the contribution (as a feature can have positive or negative projectionΒ indicated by red and blue colors). At the end of this process, we are left with features thatΒ contribute the least variance displayed at the bottom.
TheΒ last 9 features in the above plot stand out from the rest of features as they have white blocks or zero contribution towards first four principal components. Another important observation is that they are all Β phi or eta angle features.Similarly phi and eta angles in subgroup 1 and subgroup 0 show the same behavior. In next sectionΒ we will see why they are least useful in explaining the data .
Density PlotsΒ
Let's look at the density plot of the last 9 whitewashed features of subgroup 2. We see theyΒ share a few common characteristics .Β As pointed earlier, they Β are all angle features for directions of particles and jets. In the case of the 5 phi angle features they have uniform and are identically distributed over the range for both the signal and background. This is true to some extent for eta features also but for phi it is strikingly true. ConceptuallyΒ it does make sense as the particles and jet would scatter off in all directions whether or not they are signals or background. Thus, the variables will follow a uniform distribution.
The plot below contrasts this uniform distribution aspect of the least influential featuresΒ to the density plots of the first 9 most influential features.
The same is evident for angle featuresΒ for subgroup 1 and subgroup 0.
7 FeatureΒ reduction
The last section gave us plenty to think about with respect to which features are least influential in explaining the variance. But this particular Higgs Kaggle competition's success is determined by maximizing the AMS score.Β Thus discarding any features or for that sake any amount of data may not be a wise decision . But we can still use the above insights as a guide I propose the following approach.
First Iteration Β Β
- Drop Β DER features
- Drop eta features
- Drop phi Β features
- Assign Background to Higgs Undefined
- Separate Predictive modeling on 3 subgroups
SecondΒ Iteration
- Drop eta features
- Drop phi Β features
- Assign Background to Higgs Undefined
- Separate Predictive modeling on 3 subgroups
ThirdΒ Iteration
- Drop phi Β features
- Assign Background to Higgs Undefined
- Separate Predictive modeling on 3 subgroups
FourthΒ Iteration Β (if needed)
- Assign Background to Higgs Undefined
- Separate Predictive modeling on 3 subgroups
Fifth Iteration Β Β Β (if needed)
- Predictive modeling on 3 subgroups and Higgs Undefined
SixthΒ Iteration Β Β Β (Nope. Do something else)
- Brute force on full data
8 Resultant Data for PredictiveΒ Modeling
Letβs evaluate the approach outlined above. Using only the data which matters will reduce computation power and time needed.Β This model will beΒ more accurateΒ as it will reduce noise.Β Β Moreover,Β less data to deal with means thatΒ one can try more computationally expensive models like Neural networks and SVM's and try to get lucky with automatic feature engineering . Β To quantifyΒ this benefit, we Β plot the amount of data used at each modeling iteration.
9 A sample xgboost Model
We fit an xgboost decision tree model to our training data using the insights above . I chose xgboostΒ here due to itsΒ havingΒ low variance , low bias, high speed, and more accuracy.Β We will follow the "Third Iteration" schema Β from section 7 . Β Let's prepare the training andΒ testing setsΒ by dropping the phi variables, and assigning the background noise label to data where Higgs is not defined, and splitting the data where Higgs is defined into 3 subgroups.
"AUC"Β is the metric of choice hereΒ as it responds well to misclassification errors. Β The optimal number of trees to maximize the AUC score will be found by cross validation.Β We fit the whole training data toΒ the optimal number of trees for each dataset and make predictions for the test data to submit to Kaggle.Β
The Private AMS score for this model is 3.53929 . That's satisfactory for me at the moment considering we didn't tune any hyper-parameters except for the number of treesΒ and set the same threshold for all three subgroups. One need to do a grid search to find the three different thresholds.
We use "AUC" as our choice of metric as it responds well to misclassification errors. Β We use cross validation to find numbers of trees which maximizes AUC .
We fit the whole training data to most optimal number of trees for each dataset and make predictions for testΒ data and prepareΒ file for submission on Kaggle.
Private AMS score for this model isΒ 3.53929 . That's satisfactory for me at the moment considering we didn't tune any hyper-parameters except number of trees Β here . Β Plus we set same threshold for all three subgroups. One need to do a grid search to find the three different thresholds.
10Β Conclusions
- Β -999.0 are placeholders for undefined values Any attempt of imputation is plain wrong,won't work , make things bad.
- -999.0 split original data into six subgroups i.e. if Higgs mass is defined (or not) and correspondingly how many jets areΒ formed (0 , 1 or more than 1)
- A high AMS score requires predominantly a low false postive rate and also aΒ high true positive rate
- Setting everything to background for observations where Higgs mass is not definedΒ has a very tiny effect on AMS score
- One can effectively do predictive modeling on three subgroups of data where Higgs mass is defined and correspondingly how many jets are formed (0 , 1 or more than 1)
- The weights and the signal are dependentΒ onΒ each other.
- The Higgs is being sought in three channels in the data
- The DER variables are correlated to each other and the PRI variables are uncorrelated for most part
- The angle features of phi and eta have least influence on explaining the variance
- There is only 16 % of uncorrelated data which explains most of the variance in the data
- A simple xgboost model gives a respectable AMS ofΒ 3.53929
11 Future Work
- Grid search for hyperparameters of xgboost model and thresholds
- Ensemble Methods
- Stacking
- Feature engineering
- Use "iteration 1" , "2" with the least amount of uncorrelated and significant data with more computationally expensive methods such as Neural networks and SVMs to do automatic feature engineering