NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Machine Learning > Gradient Boosters and the RossMann (Project)

Gradient Boosters and the RossMann (Project)

Paul Grech and David Comfort
Posted on Dec 7, 2015

Contributed by David Comfort and Paul Grech . They took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on their fourth class project(due at 8th week of the program).

====================

Overview

As part of a Kaggle competition, we were challenged by Rossmann, the second largest chain of German drug stores, to predict the daily sales for 6 weeks into the future for more than 1,000 stores.  Exploratory data analysis revealed several novel features, including spikes in sales prior to, and preceding store refurbishment. We also engineered several novel features by the inclusion of external data including Google Trends, macroeconomic data, as well as weather data. We then used H20, a fast, scalable parallel-processing engine for machine learning, to build predictive models utilizing random forests, gradient boosting machines, as well as deep learning. Lastly, we combined these models using different ensemble methods to obtain better predictive performance.

Training data was provided for 1,115 Rossmann stores from January 1st 2013 through July 31st 2015 .The task was to forecast 6 weeks (August 1st 2015 through September 17th 2015) of sales for 856 of the Rossmann stores identified within the testing data.

Data Sets

  • TRAIN.CSV - historical data including sales
  • TEST.CSV - historical data excluding sales
  • SAMPLE_SUBMISSION.CSV - sample submission file in the correct format
  • STORE.CSV - supplemental information describing each of the stores

Data Fields

  • Id - represents a (Store, Date) duple within the test set
  • Store - Unique Id for each store
  • Sales - The turnover for any given day (variable to be predicted)
  • Customers - The number of customers on a given day
  • Open - An indicator for whether the store was open: 0 = closed, 1 = open
  • StateHoliday - Indicates a state holiday
    • Normally all stores, with few exceptions, are closed on state holidays.
    • All schools are closed on public holidays and weekends.
    • a = public holiday
    • b = Easter holiday
    • c = Christmas
    • 0 = None
  • SchoolHoliday - Indicates if the (Store, Date) was affected by the closure of public schools
  • StoreType - Differentiates between 4 different store models:
    • a, b, c, d
  • Assortment - Describes an assortment level:
    • a = basic, b = extra, c = extended
  • CompetitionDistance - Distance in meters to the nearest competitor store
  • CompetitionOpenSince [Month/Year] - Approximate year and month of the time the nearest competitor was opened
  • Promo - Indicates whether a store is running a promo on that day
  • Promo2 - Continuing and consecutive promotion for some stores:
    • 0 = store is not participating
    • 1 = store is participating
  • Promo2Since [Year/Week] - Describes the year and calendar week when the store started participating in Promo2
  • PromoInterval - Describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew.
    • โ€œFeb,May,Aug,Novโ€ means each round starts in February, May, August, November of any given year for that store

Exploratory Data Analysis

Exploratory data analysis was performed in ipython notebook and R. Findings discovered throughout the EDA process were all addressed during data cleaning and feature engineering. Courtesy of fellow Kaggler Paul Shearer for creating a fantastic dygraph to display all data.

Picture1

Data Cleaning

  1. Impute Open = 1 for missing Open test dataset
    • Special case found during EDA. Store 622 has several missing dates which are all weekdays with sales recorded.
  2. Set Open = 0 when Sales = 0 OR Customers = 0
  3. Standardize StateHoliday due to the use of character 0 and integer 0
  4. Separate date column into year, month and day. Also convert Date column to type โ€˜dateโ€™ and extract:
    • day_of_year
    • week_of_year
    • quarter
    • month_start
    • month_end
    • quarter_start
    • quarter_end
  5. Remove observations where stores are closed. These values can be hard coded after prediction since no sales can occur with a closed store.
  6. Set store as factor.
  7. Merge store dataset with train and test
  8. Stores.csv contained an abundance of missing values. Machine learning methods were chosen with this in mind. The following methods that were implemented handle missing values with ease. Their methods are described below.
    • Distributed Random Forest and Gradient Boosting Machine treat missing (NA) factor levels as the smallest value present (left-most in the bins), which can go left or right for any split, and unseen factor levels (the case here) to always go left in any split.
    • Deep Learning by default makes an extra input neuron for missing and unseen categorical levels which can remain untrained if there were no such instances in the training data, leading to some random contribution to the next layer during testing.
  9. Experimenting with variables as factors:
    • We experimented with setting features to factors in order to see their effects upon the MSE and residual errors. We should note that H20 can deal with large numbers of factors and the categorical data does not need to be one-hot encoded. H2O does not expand categorical data into dummy variables, but instead uses a bitset to determine which categorical levels go left or right on each split.
    • H2O is also more accurate than R/Python. I think the reason for that is dealing properly with the categorical variables, i.e. internally in the algo rather than working from a previously 1-hot encoded dataset where the link between the dummies belonging to the same original variable is lost. This is by the way how the R package should work if the number of categories is small (but not in our case).
      • Train a model on data that has a categorical predictor (column) with levels B,C,D (and no other levels). Letโ€™s call these levels the โ€œtraining set domainโ€: {B,C,D}
      • During scoring, a test set has only rows with levels A,C,E for that column, the โ€œtest set domainโ€: {A,C,E}
      • For scoring, we construct the joint โ€œscoring domainโ€: {B,C,D,A,E}, which is the training domain with the extra test set domain entries appended.
      • Each model can handle these extra levels {A,E} during scoring separately.The way H2O deals with categories is not only more proper and gets better AUC, but it is makes it faster and more memory efficient. See Categorical variables with random forest for more information.In addition, most machine learning tools break when you try to predict with a new level for a categorical input that was not present in the training set. However, H2O is able to handle such a situation. Hereโ€™s an example of how this works:
    • See prediction with categorical variable with a new level for more information.
  10. We use a log transformation for the sales in order not to be as sensitive to high sales. A decent rule of thumb is if the data spans an order of magnitude, then consider using a log transform.

Feature Engineering

  1. Promotion First Day and Promotion Second Day
  2. DayBeforeClosed
  3. DayAfterClosed
  4. Store Open on Sunday
  5. Closed
  6. Refurbishment period
    • EDA lead to a discovering a spike in sales before and after a period of extended close indicating a clearance and grand opening sale
    • One thing to note was that Some stores in the dataset were temporarily closed for refurbishment.
  7. DaysBeforeRefurb
  8. DaysAfterRefurb
  9. Competition Feature
    • We create this new feature by taking the square root of the difference between the maximum distance of a competitor store among all the stores and the distance of a competitor store for an individual store, times the time since a competitor opened:
# Competition Feature
train$Competition <- 
  (sqrt(max(train$CompetitionDistance, na.rm = TRUE) - 
          train$CompetitionDistance)) *
  (((train$year - train$CompetitionOpenSinceYear) * 12) - 
     (train$CompetitionOpenSinceMonth-train$month))

test$Competition <- 
  (sqrt(max(test$CompetitionDistance, na.rm = TRUE) - 
          test$CompetitionDistance))*
  (((test$year - test$CompetitionOpenSinceYear) * 12) - 
     (test$CompetitionOpenSinceMonth-test$month))

Open Data Sources

  • German States derived from StateHoliday
  • German State Weather
  • Google Trends
  • Export into CSV

Introduction to H20

H2O is an open source math & machine learning engine for big data that brings distribution and parallelism to powerful algorithms while keeping the widely used languages such as R, Spark, Python, Java, and JSON as an API. Using in-memory compression, H2O handles billions of data rows in-memory, even with a small cluster. H2O includes many common machine learning algorithms, such as generalized linear modeling (linear regression, logistic regression, etc.), Na ฬˆฤฑve Bayes, principal components analysis, k-means clustering, and others. H2O also implements best-in-class algorithms at scale, such as distributed random forest, gradient boosting, and deep learning.

H2O Process

  • Have correct version of Java installed
  • Start H20 instance
  • Set features to factors (and test)
  • Create validation set
  • Tune Parameters for each model (manual or grid search)
  • Use model to make predictions on test set
  • Iterate

Load Libraries into R

library(caret)
library(data.table)  
library(h2o)
library(plyr)

Initialize H2O Cluster With All Available Threads

One should use h2o.shutdown() if changing parameters below.  Also, setting assertion = FALSE seems to help with stability of H20.

h2o.init(nthreads=-1,max_mem_size='8G', assertion = FALSE)
## Successfully connected to http://127.0.0.1:54321/ 
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         16 hours 46 minutes 
##     H2O cluster version:        3.6.0.8 
##     H2O cluster name:           H2O_started_from_R_2015_ukz280 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   7.98 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE
## IP Address: 127.0.0.1 
## Port      : 54321 
## Session ID: _sid_ac1406fd65438164da7936d76cfe44b2 
## Key Count : 0

Read Test and Train Data

Data.table was used as a means of reading in data due to its increased efficiency for data frame manipulation over dplyr.

train <- fread("KaggleProject/data/train_states_R_v8.csv",
                stringsAsFactors = T)
test  <- fread("KaggleProject/data/test_states_R_v8.csv",
                stringsAsFactors = T, showProgress=FALSE)
store <- fread("input/store.csv",
                stringsAsFactors = T)

Create Stratified Folds For Cross-Validation

The Rossmann dataset is a โ€œpooled-repeated measuresโ€ dataset, whereby multiple observations from different stores are grouped together. Hence, the internal cross-validation has to be done in an โ€œhonestโ€ manner, i.e., all the observations from one store must belong to a single fold. Otherwise, it can lead to overfitting. Creating stratified folds for cross-validation can be easily achieved by utilizing the createFolds method from the Caret package in R. Since the stores dataset is a list of each store with one store per row, we can create the folds in the stores dataset prior to merging this dataset with the train and test datasets.

folds <- createFolds(factor(store$Store), k = 10, list = FALSE)
store$fold <- folds
ddply(store, 'fold', summarise, prop=mean(store$fold)/10)
##    fold      prop
## 1     1 0.5598206
## 2     2 0.5598206
## 3     3 0.5598206
## 4     4 0.5598206
## 5     5 0.5598206
## 6     6 0.5598206
## 7     7 0.5598206
## 8     8 0.5598206
## 9     9 0.5598206
## 10   10 0.5598206

H2O Training and Validation Test Sets

We simply split the training set by date; the training set is simply all rows prior to June 2015 and the validation set are rows June 2015 and on. We then check the dimensions of the training and test sets.

trainHex<-as.h2o(train[year <2015 | month <6,],
          destination_frame = "trainHex")
validHex<-as.h2o(train[year == 2015 & month >= 6,],
          destination_frame = "validHex")

dim(trainHex)
dim(validHex)

## [1] 785727     37
## [1] 58611    37

Feature Set For Training

We exclude the Store Id, Date, Sales, LogSales and Customers. The Store Id is only used as a identifier; we split the date into different components (day, month, and year); the Sales and Log of the Sales are what we predicting; and the we are not given the customers in the test and, hence, cannot use it as a feature in the training set.

features<-names(train)[!(names(train) %in% 
                           c("Id","Date","Sales","logSales", 
                             "Customers", "Closed", "fold"))]
features
##  [1] "Store"                     "DayOfWeek"                
##  [3] "Open"                      "Promo"                    
##  [5] "StateHoliday"              "SchoolHoliday"            
##  [7] "year"                      "month"                    
##  [9] "day"                       "day_of_year"              
## [11] "week_of_year"              "PromoFirstDate"           
## [13] "PromoSecondDate"           "DayBeforeClosed"          
## [15] "DayAfterClosed"            "SundayStore"              
## [17] "DayBeforeRefurb"           "DayAfterRefurb"           
## [19] "DaysBeforeRefurb"          "DaysAfterRefurb"          
## [21] "State"                     "StoreType"                
## [23] "Assortment"                "CompetitionDistance"      
## [25] "CompetitionOpenSinceMonth" "CompetitionOpenSinceYear" 
## [27] "Promo2"                    "Promo2SinceWeek"          
## [29] "Promo2SinceYear"           "PromoInterval"            
## [31] "Competition"

Random Forest

Intuition

  • Average an ensemble of weakly predicting (larger) trees where each tree is de-correlated from all other trees
  • Bootstrap aggregation (bagging)
  • Fits many trees against different samples of the data and average together

Conceptual

  • Combine multiple decision trees, each fit to a random sample of the original data
  • Random samples
  • Rows / Columns
  • Reduce variance wtih minimal increase in bias

Strengths

  • Ease of use with limited well-established default parameters
  • Robust
  • Competitive accuracy for most data sets
  • Random forest combines trees and hence incorporates most of the advantages of trees like handling missing values in variable, suiting for both classification and regression, handling highly non-linear interactions and classification boundaries.
  • In addition, Random Forest gives built-in estimates of accuracy, gives automatic variable selection. variable importance, handles wide data โ€“ data with more predictors than observations and works well off the shelf โ€“ needs no tuning, can get results very quickly. * The runtimes are quite fast, and they are able to deal with unbalanced and missing data.

Weaknesses

  • Slow to score
  • Lack of transparency
  • When used for regression they cannot predict beyond the range in the training data, and that they may over-fit data sets that are particularly noisy. However, the best test of any algorithm is determined by how well it works upon a particular data set.

Train a Random Forest Using H20

We should note that we used the stratified folds created in the step above, but H20 also has internal cross-validation (by setting nfolds to the desired number of cross-validation folds). RF A random forest is an ensemble of decision trees that will output a prediction value. An ensemble model combines the results from different models. A Random Forest is combination of classification and regression. The result from an ensemble model is usually better than the result from one of the individual models. In Random Forest, each decision tree is constructed by using a random subset of the training data that has predictors with known response. After you have trained your forest, you can then pass each test row through it, in order to output a prediction. The goal is to predict the response when itโ€™s unknown. The response can be categorical(classification) or continuous (regression). In a decision tree, an input is entered at the top and as it traverses down the tree the data gets bucketed into smaller and smaller sets. The random forest takes the notion of decision trees to the next level by combining trees. Thus, in ensemble terms, the trees are weak learners and the random forest is a strong learner.

rfHex <- h2o.randomForest(x=features,
                          y="logSales", 
                          model_id="introRF",
                          training_frame=trainHex,
                          validation_frame=validHex,
                          mtries = -1, # default
                          sample_rate = 0.632, # default
                          ntrees = 100,
                          max_depth = 30,
                          nbins_cats = 1115, ## allow it to fit store ID
                          nfolds = 0,
                          fold_column="fold",
                          seed = 12345678 #Seed for random numbers (affects sampling)
)

Key Parameters for Random Forests

The key parameters for the Random Forest model on an H2O frame include:

  • x: A vector containing the names or indices of the predictor variables to use in building the GBM model. We have defined x to be the features to consider in building our model.
  • y: The name or index of the response variable. In our case, the log of the Sales.
  • training_frame: An H2O Frame object containing the variables in the model. In our case, this was the subset of the train dataset defined above.
  • model_id: The unique id assigned to the resulting model.
  • validation_frame: An H2O Frame object containing the variables in the model. In our case, this was the subset of the train dataset defined above.
  • mtries: At each iteration, a randomly chosen subset of the features in the training data is selected and evaluated to define the optimal split of that subset. Mtries specifies the number of features to be selected from the whole set. If set to -1, defaults to p/3 for regression, where p is the number of predictors.
  • sample_rate: The sampling rate at each split.
  • ntrees: A nonnegative integer that determines the number of trees to grow.
  • max_depth: Maximum depth to grow the tree. A user-defined tuning parameter for controlling model complexity (by number of edges). Depth is the longest path from root to the furthest leaf. Maximum depth also specifies the maximum number of interactions that can be accounted for by the model.
  • nbins_cats: For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting. In our case, we set it equal to the number of stores we are trying to model (1,115).
  • nfolds: Number of folds for cross-validation. Since we are using stratified cross-validation, we have set this to 0.
  • fold_column: Column with cross-validation fold index assignment per observation, which we have set to fold, which was created in the โ€œCreate stratified folds for cross-validationโ€ step above.

Benchmarking H20 and Random Forests

x-plot-timeAs noted before, H20 can use all the cores on a machine and hence should run substantially faster than if we used another random forest package in R. Szilard Pafka recently benchmarked several machine learning tools for scalability, speed and accuracy, including H2O. Pafka concluded that the H2O implementation of random forests is โ€œfast, memory efficient and uses all cores. It deals with categorical variables automatically. It is also more accurate than R/Python, which may be because of dealing properly with the categorical variables, i.e. internally in the algo rather than working from a previously 1-hot encoded dataset (where the link between the dummies belonging to the same original variable is lost).โ€

- Benchmarking Random Forest Implementations

For more about information about this benchmarking study, see Simple/limited/incomplete benchmark for scalability, speed and accuracy of machine learning libraries for classification and a video presentation, Szilard Pafka: Benchmarking Machine Learning Tools for Scalability, Speed and Accuracy.

Retrain Random Forest On Training Set

We retrain on the whole training set (without the validation set or the cross-validation set) once we have tuned the parameters. Although not shown here, we extensively tested the different paramters for random forests as well as performed feature selection.

# Use the whole train dataset.
trainHex<-as.h2o(train)

# Run Random Forest Model
rfHex <- h2o.randomForest(x=features,
                           y="logSales", 
                           model_id="introRF",
                           training_frame=trainHex,
                           mtries = -1, # default
                           sample_rate = 0.632, # default
                           ntrees = 100,
                           max_depth = 30,
                           nbins_cats = 1115, ## allow it to fit store ID
                           nfolds = 0,
                           seed = 12345678 #Seed for random numbers (affects sampling)
 )

# Model Summary and Variable Importanc
summary(rfHex)
varimps = data.frame(h2o.varimp(rfHex))

#Load test dataset into H2O from R
testHex<-as.h2o(test)

#Get predictions out; predicts in H2O, as.data.frame gets them into R
predictions<-as.data.frame(h2o.predict(rfHex,testHex))

# Return the predictions to the original scale of the Sales data
pred <- expm1(predictions[,1])
summary(pred)
submission <- data.frame(Id=test$Id, Sales=pred)

# Save the submission file
write.csv(submission, "../../data/H2O_Random_Forest_v47.csv",row.names=F)

 

Gradient Boosted Models

We also utilized gradient boosted models (GBM) utilizing H2O as well.

Intuition

  • Average an ensemble of weakly predicting (small) trees where each tree โ€œadjustsโ€ to the โ€œmistakesโ€ of the preceding trees.
  • Boosting
  • Fits consecutive trees where each solves for the net error of the prior trees.

Conceptual

  • Boosting: ensemble of weak learners (the notion of โ€œweakโ€ is being challenged in practice)
  • Fits consecutive trees where each solves for the net loss of the prior trees
  • Results of new trees are applied partially to the entire solution.

Strenths

  • Often best possible model
  • Robust
  • Directly optimizes cost function ##### Weaknesses
  • Overfits
  • Need to find proper stopping point
  • Sensitive to noise and extreme values
  • Several hyper-parameters
  • Lack of transparency

Important components:

  • Number of trees
  • Maximum depth of tree
  • Learning rate (shrinkage parameter), where smaller learning rates tend to require larger number of tree and vice versa.

Gradient Boosting Models in Detail

A GBM is an ensemble of either regression or classification tree models. Both are forward-learning ensemble methods that obtain predictive results using gradually improved estimations. Boosting is a flexible nonlinear regression procedure that helps improve the accuracy of trees. Weak classification algorithms are sequentially applied to the incrementally changed data to create a series of decision trees, producing an ensemble of weak prediction models. While boosting trees increases their accuracy, it also decreases speed and user interpretability. The gradient boosting method generalizes tree boosting to minimize these drawbacks. For more information, see Gradient Boosted Models with H2O

H2Oโ€™s GBM Functionalities

  • Supervised learning for regression and classification tasks
  • Distributed and parallelized computation on either a single node or a multi-node cluster
  • Fast and memory-e cient Java implementations of the algorithms
  • The ability to run H2O from R, Python, Scala, or the intuitive web UI (Flow)
  • automatic early stopping based on convergence of user-specified metrics to user-specified relative tolerance
  • stochastic gradient boosting with column and row sampling for better generalization
  • support for exponential families (Poisson, Gamma, Tweedie) and loss functions in addition to binomial (Bernoulli), Gaussian and multinomial distributions
  • grid search for hyperparameter optimization and model selection
  • modelexportinplainJavacodefordeploymentinproductionenvironments
  • additional parameters for model tuning (for a complete listing of parame- ters, refer to the Model Parameters section.)

Key Parameters for GBM

There are three primary paramters, or knobs, to adjust in order optimize GBMs.

  1. Adding trees will help. The default is 50.
  2. Increasing the learning rate will also help. The contribution of each tree will be stronger, so the model will move further away from the overall mean.
  3. Increasing the depth will help. This is the parameter that is the least straightforward. Tuning trees and learning rate both have direct impact that is easy to understand. Changing the depth means you are adjusting the โ€œweaknessโ€ of each learner. Adding depth makes each tree fit the data closer.

Retrain GBM On Training Set

# Train a GBM Model
gbmHex <- h2o.gbm(x=features,
                   y="logSales",
                   training_frame=trainHex,
                   model_id="introGBM",
                   nbins_cats=1115,
                   sample_rate = 0.5,
                   col_sample_rate = 0.5,
                   max_depth = 20,
                   learn_rate=0.05,
                   seed = 12345678, #Seed for random numbers (affects sampling)
                   ntrees = 250,
                   fold_column="fold",
                   validation_frame=validHex # validation set
) 

#Get a summary of the model and variable importance
summary(gbmHex)
varimps = data.frame(h2o.varimp(gbmHex))

# Get predictions out; predicts in H2O, as.data.frame gets them into R
predictions<-as.data.frame(h2o.predict(gbmHex,testHex))

# Return the predictions to the original scale of the Sales data
pred <- expm1(predictions[,1])
summary(pred)
submission <- data.frame(Id=test$Id, Sales=pred)

# Save the submission file
write.csv(submission, "../../data/H2O_GBM_v30.csv",row.names=F)

Results

Although we scored in top 5% of competition involving over 3,300 teams, the most valuable lesson learned was the understanding of accuracy vs interpretation trade off. By achieving the accuracy needed to score well for this Kaggle competition, we traded off interpretation that may have been needed to explain to management if this were to have been an actual business project. As a way of exploring this trade off, one of the first methods chosen was a multiple linear regression in order to gain a greater understanding of the characteristics for each feature. We achieved approximately 25% error using this simpler method. This may be an adequate prediction if market trends needed for broad scale decision making was the goal however being a Kaggle competition where accuracy was priority, this was not a method to be explored further.

About Authors

Paul Grech

Paul Grech is a Data Scientist with passion for exploring insight in big data. He is eager to advance his skills and build value in a professional environment. Previous experience include several years of professional consulting experience in...
View all posts by Paul Grech >

David Comfort

David Comfort, D.Phil. is a data scientist, scientist, activist and writer. His doctoral research at Oxford University was in protein nuclear magnetic resonance (NMR) and computational biology. His post-doctoral research at UCLA involved genomics, bioinformatics as well as...
View all posts by David Comfort >

Related Articles

Capstone
Catching Fraud in the Healthcare System
Data Analysis
Car Sales Report R Shiny App
Data Analysis
Injury Analysis of Soccer Players with Python
Capstone
Acquisition Due Dilligence Automation for Smaller Firms
R Shiny
Forecasting NY State Tax Credits: R Shiny App for Businesses

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application