XGBoost: A Fast and Accurate Boosting Trees Model
The Author: Tong He is a data scientist in SupStat Inc. and a master student in Simon Fraser Unviersity. His currently research interests include machine learning, data mining and bioinformatics.
In the work of data analysis, we usually build models to make predictions on the data. Among the choices in R, randomForest, gbm and glmnet are three exceptionally popular packages since they appear in almost all the data mining competitions on Kaggle. In my personal experiences, gbm costs less memory and time than randomForest, and users indeed prefer it. In python's sklearn library, we also have the GradientBoostingClassifier module.
Boosting classifier belongs to ensemble models, the basic idea is to aggregate hundreds of less accurate tree-based models to form a very accurate model. This model usually iteratively generates a new tree-based model at each step. People have proposed various ways to get a reasonable base model. In Friedman's Gradient Boosting Machine, it incorporates gradient descent method to build a tree which decrease the objective along the direction of the gradient. In practice we need to generate thousands of trees to get an excellent result on a relatively large data set. However the current implementation of the algorithm is not fast enough so that we may need to wait for a long time for the result.
Now, we have XGBoost to solve this problem. XGBoost is short for "eXtreme Gradient Boosting". It is a gradient boosting implementation in C++, and its author is Tianqi Chen, a Ph.D. Student in Washington University. He felt limited by the efficiency of the current boosting libraries so he started the project in early 2014. This tools was getting well shaped in the summer of 2014. Its algorithm is improved than the vanilla gradient boosting model, and it automatically parallels on a multi-threaded CPU. The debut of XGBoost is the higgs boson signal competition on Kaggle, and it becomes popular afterwards. Nowadays there are many competition winners using XGBoost in their model.
To make the tool accepted by more users, Tianqi developed its python interface and I developed the R interface and it is on CRAN now. The following sections focus on the general R interface and I suggest readers to get a basic idea of XGBoost's features, and then learn the exact interface from the documentation.
1. Basic functions
First we can install the pacakge from CRAN:
to follow the latest version, we can install from github:
Time to code! Run the following code to load the sample:
require(xgboost) data(agaricus.train, package='xgboost') data(agaricus.test, package='xgboost') train <- agaricus.train test <- agaricus.test
This data asks us to judge whether a mushroom is poisonous or not by its attributes. The attributes are denoted as existing by 1, non-existing by 0. Therefore it is stored as a sparse matrix.
Don't worry for it, because XGBoost supports both dense and sparse matrices as input. Here comes the training command:
> bst <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, + nround = 2, objective = "binary:logistic")  train-error:0.046522  train-error:0.022263
We have iterated twice and the information of training error is printed. If the data is too large to load in R, users can set data = 'path_to_file' to read it directly from the disk. Currently XGBoost supports local data files in the libsvm format.
It takes you one line to make prediction:
pred <- predict(bst, test$data)
It is very convenient to do cross validation, since the xgb.cv function only asks for an additional parameter 'nfold' than the XGBoost.
> cv.res <- xgb.cv(data = train$data, label = train$label, max.depth = 2, + eta = 1, nround = 2, objective = "binary:logistic", + nfold = 5)  train-error:0.046522+0.001102 test-error:0.046523+0.004410  train-error:0.022264+0.000864 test-error:0.022266+0.003450 > cv.res train.error.mean train.error.std test.error.mean test.error.std 1: 0.046522 0.001102 0.046523 0.004410 2: 0.022264 0.000864 0.022266 0.003450
Its return value is a data.table containing the measurements on training and testing folds. One can easily track the best number of rounds.
2. Fast and accurate
The above code is a very brief introduction and the data is too small to show the power of XGBoost. XGBoost is fast for the following reasons:
- XGBoost utilizes OpenMP which can parallel the code on a multithreaded CPU automatically.
- XGBoost has defined a data structure DMatrix to store the data matrix. This data structure will perform some preprocessing work on the data so that the latter iteration is faster.
We tried our best to keep all the parameters as the same and did the following experiment:
|Model and Parameter||gbm||XGBoost|
|1 thread||2 threads||4 threads||8 threads|
|Time (in secs)||761.48||450.22||102.41||44.18||34.04|
The CPU for this experiment is i7-4700MQ. The sklearn in python has the similar efficiency as gbm. You can try to reproduce the result by downloading the data and run the code here.
Besides the significantly boosted speed, XGBoost also achieves high accuracy in the competitions. In the beginning of the higgs boson competition, people surprisingly found it that the gbm in R and python cannot beat the official benchmark, while xgboost came out and made it into Top 10 at that time. The main reason for the improvement of the accuracy is because the newly-defined regularization term and the pruning approach which makes the learned model more stable. For more details please check the official documentation.
3. Advanced features
Besides the speed and accuracy, XGBoost has a lot of other useful features. The following list contains some of them. Readers can click the demo to the like of the sample code
- As long as you can calculate the first and second derivative of the loss function, you can customize the goal of the training algorithm in XGBoost. demo
- Users are allowed to define the metric in cross validation, for example RMSE, RMSLE for regression and Error rate, AUC or F1-score for classification. Or even the unusual metric AMS in the higgs boson competition. demo
- the cross validation function can generate the prediction result on each test fold to help users build ensemble models easier. demo
- Users can try to iterate for 1000 times first and check the model's strength, then keep doing another 1000 iterations on top of the previous result. demo
- The model can output the id of the leaf for each data sample. It is one part of the model from a facebook paper. demo
- The model can calculate the feature importance and plot the trees. demo
- Users can boost the regularized linear models instead of the trees. demo
These features enable users to use this tool in various of application scenarios. Actually many of them are from the requests of the users.
4. Learning Sources
The information in this article is limited. We have provided several scripts to help you understand the tool better:
- The folder for all the sample scripts
- The script for the higgs boson competition
- The script for the otto competition
If you are interested in understanding deeper of the algorithm or the tool, you may find the following links useful: