XGBoost: A Fast and Accurate Boosting Trees Model

Avatar
Posted on Oct 15, 2015

The Author: Tong He is a data scientist in SupStat Inc. and a master student in Simon Fraser Unviersity. His currently research interests include machine learning, data mining and bioinformatics.

In the work of data analysis, we usually build models to make predictions on the data. Among the choices in R, randomForest, gbm and glmnet are three exceptionally popular packages since they appear in almost all the data mining competitions on Kaggle. In my personal experiences, gbm costs less memory and time than randomForest, and users indeed prefer it. In python's sklearn library, we also have the GradientBoostingClassifier module.

Boosting classifier belongs to ensemble models, the basic idea is to aggregate hundreds of less accurate tree-based models to form a very accurate model. This model usually iteratively generates a new tree-based model at each step. People have proposed various ways to get a reasonable base model. In Friedman's Gradient Boosting Machine, it incorporates gradient descent method to build a tree which decrease the objective along the direction of the gradient. In practice we need to generate thousands of trees to get an excellent result on a relatively large data set. However the current implementation of the algorithm is not fast enough so that we may need to wait for a long time for the result.

Now, we have XGBoost to solve this problem. XGBoost is short for "eXtreme Gradient Boosting". It is a gradient boosting implementation in C++, and its author is Tianqi Chen, a Ph.D. Student in Washington University. He felt limited by the efficiency of the current boosting libraries so he started the project in early 2014. This tools was getting well shaped in the summer of 2014. Its algorithm is improved than the vanilla gradient boosting model, and it automatically parallels on a multi-threaded CPU. The debut of XGBoost is the higgs boson signal competition on Kaggle, and it becomes popular afterwards. Nowadays there are many competition winners using XGBoost in their model.

To make the tool accepted by more users, Tianqi developed its python interface and I developed the R interface and it is on CRAN now. The following sections focus on the general R interface and I suggest readers to get a basic idea of XGBoost's features, and then learn the exact interface from the documentation.

1. Basic functions

First we can install the pacakge from CRAN:

install.packages('xgboost')

to follow the latest version, we can install from github:

devtools::install_github('dmlc/xgboost',subdir='R-package')

Time to code! Run the following code to load the sample:

require(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test

This data asks us to judge whether a mushroom is poisonous or not by its attributes. The attributes are denoted as existing by 1, non-existing by 0. Therefore it is stored as a sparse matrix.

Don't worry for it, because XGBoost supports both dense and sparse matrices as input. Here comes the training command:

> bst <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1,
+                nround = 2, objective = "binary:logistic")
[0] train-error:0.046522
[1] train-error:0.022263

We have iterated twice and the information of training error is printed. If the data is too large to load in R, users can set data = 'path_to_file' to read it directly from the disk. Currently XGBoost supports local data files in the libsvm format.

It takes you one line to make prediction:

pred <- predict(bst, test$data)

It is very convenient to do cross validation, since the xgb.cv function only asks for an additional parameter 'nfold' than the XGBoost.

> cv.res <- xgb.cv(data = train$data, label = train$label, max.depth = 2, 
+                  eta = 1, nround = 2, objective = "binary:logistic", 
+                  nfold = 5)
[0] train-error:0.046522+0.001102   test-error:0.046523+0.004410
[1] train-error:0.022264+0.000864   test-error:0.022266+0.003450
> cv.res
   train.error.mean train.error.std test.error.mean test.error.std
1:         0.046522        0.001102        0.046523       0.004410
2:         0.022264        0.000864        0.022266       0.003450

Its return value is a data.table containing the measurements on training and testing folds. One can easily track the best number of rounds.

2. Fast and accurate

The above code is a very brief introduction and the data is too small to show the power of XGBoost. XGBoost is fast for the following reasons:

  1. XGBoost utilizes OpenMP which can parallel the code on a multithreaded CPU automatically.
  2. XGBoost has defined a data structure DMatrix to store the data matrix. This data structure will perform some preprocessing work on the data so that the latter iteration is faster.

We tried our best to keep all the parameters as the same and did the following experiment:

Model and Parameter gbm XGBoost
1 thread 2 threads 4 threads 8 threads
Time (in secs) 761.48 450.22 102.41 44.18 34.04

The CPU for this experiment is i7-4700MQ. The sklearn in python has the similar efficiency as gbm. You can try to reproduce the result by downloading the data and run the code here.

Besides the significantly boosted speed, XGBoost also achieves high accuracy in the competitions. In the beginning of the higgs boson competition, people surprisingly found it that the gbm in R and python cannot beat the official benchmark, while xgboost came out and made it into Top 10 at that time. The main reason for the improvement of the accuracy is because the newly-defined regularization term and the pruning approach which makes the learned model more stable. For more details please check the official documentation.

3. Advanced features

Besides the speed and accuracy, XGBoost has a lot of other useful features. The following list contains some of them. Readers can click the demo to the like of the sample code

  1. As long as you can calculate the first and second derivative of the loss function, you can customize the goal of the training algorithm in XGBoost. demo
  2. Users are allowed to define the metric in cross validation, for example RMSE, RMSLE for regression and Error rate, AUC or F1-score for classification. Or even the unusual metric AMS in the higgs boson competition. demo
  3. the cross validation function can generate the prediction result on each test fold to help users build ensemble models easier. demo
  4. Users can try to iterate for 1000 times first and check the model's strength, then keep doing another 1000 iterations on top of the previous result. demo
  5. The model can output the id of the leaf for each data sample. It is one part of the model from a facebook paper. demo
  6. The model can calculate the feature importance and plot the trees. demo
  7. Users can boost the regularized linear models instead of the trees. demo

These features enable users to use this tool in various of application scenarios. Actually many of them are from the requests of the users.

4. Learning Sources

The information in this article is limited. We have provided several scripts to help you understand the tool better:

  • The folder for all the sample scripts
  • The script for the higgs boson competition
  • The script for the otto competition

If you are interested in understanding deeper of the algorithm or the tool, you may find the following links useful:

About Author

Leave a Comment

Avatar
bloodchalk0.webnode.Com June 21, 2017
Don't enticed by the scams of one of these spam blogs. The main aim of the spam site will be steal health and fitness information as well as the credit card number or even redirect us to unwanted offers or with spyware they will infect our computer. http://bloodchalk0.webnode.com/what-everyone-is-saying-about-mobile-insurance-and-what-you-should-do
Avatar
Damien Hippenstiel June 18, 2017
The Motorola mobile price gives us the flexibility and cost-effective. With the Android Computer coming and performing so better involving market, all the major brands are now in onto it. And so is the Motorola. http://waiterwalk04.blog5.net/4261022/7-ridiculous-rules-about-phone-insurance
Avatar
Nick M February 24, 2016
Great introduction to XGBoost in R - thank you! Have been facing problems generating scores from the gbm package fast enough for our needs, but I suspect XGBoost may resolve this issue. Looking forward to trying it out!

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp