Trains a XGBoost model in R
Trains a Extreme Gradient Boosting Model. XGBoost belongs to a family of boosting algorithms that creates an ensemble of weak learner to learn about data. It is a wrapper for original xgboost R package, you can find the documentation here: http://xgboost.readthedocs.io/en/latest/parameter.html
boosterthe trainer type, the values are gbtree(default), gblinear, dart:gbtree
objectivespecify the learning task. Check the link above for all possible values.
nthreadnumber of parallel threads used to run, default is to run using all threads available
silent0 means printing running messages, 1 means silent mode
n_estimatorsnumber of trees to grow, default = 100
learning_rateStep size shrinkage used in update to prevents overfitting. Lower the learning rate, more time it takes in training, value lies between between 0 and 1. Default = 0.3
gammaMinimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be. Value lies between 0 and infinity, Default = 0
max_depththe maximum depth of each tree, default = 6
min_child_weightMinimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression task, this simply corresponds to minimum number of instances needed to be in each node. The larger min_child_weight is, the more conservative the algorithm will be. Value lies between 0 and infinity. Default = 1
subsampleSubsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration. Value lies between 0 and 1. Default = 1
colsample_bytreeSubsample ratio of columns when constructing each tree. Subsampling will occur once in every boosting iteration. Value lies between 0 and 1. Default = 1
lambdaL2 regularization term on weights. Increasing this value will make model more conservative. Default = 1
alphaL1 regularization term on weights. Increasing this value will make model more conservative. Default = 0
eval_metricEvaluation metrics for validation data, a default metric will be assigned according to objective
print_everyprint training log after n iterations. Default = 50
fevalcustom evaluation function
early_stoppingUsed to prevent overfitting, stops model training after this number of iterations if there is no improvement seen
maximizeIf feval and early_stopping_rounds are set, then this parameter must be set as well. When it is TRUE, it means the larger the evaluation score the better.
custom_objectivecustom objective function
save_periodwhen it is non-NULL, model is saved to disk after every save_period rounds, 0 means save at the end.
save_namethe name or path for periodically saved model file.
xgb_modela previously built model to continue the training from. Could be either an object of class xgb.Booster, or its raw data, or the name of a file with a previously saved model.
callbacksa list of callback functions to perform various task during boosting. See callbacks. Some of the callbacks are automatically created depending on the parameters' values. User can provide either existing or their own callback methods in order to customize the training process.
verboseIf 0, xgboost will stay silent. If 1, xgboost will print information of performance. If 2, xgboost will print some additional information. Setting verbose > 0 automatically engages the cb.evaluation.log and cb.print.evaluation callback functions.
watchlistwhat information should be printed when verbose=1 or verbose=2. Watchlist is used to specify validation set monitoring during training. For example user can specify watchlist=list(validation1=mat1, validation2=mat2) to watch the performance of each round's model on mat1 and mat2
num_classset number of classes in case of multiclassification problem
weighta vector indicating the weight for each row of the input.
na_missingby default is set to NA, which means that NA values should be considered as 'missing' by the algorithm. Sometimes, 0 or other extreme value might be used to represent missing values. This parameter is only used when input is a dense matrix.
feature_namesinternal use, stores the feature names for model importance
cv_modelinternal use
new()XGBTrainer$new(
booster,
objective,
nthread,
silent,
n_estimators,
learning_rate,
gamma,
max_depth,
min_child_weight,
subsample,
colsample_bytree,
lambda,
alpha,
eval_metric,
print_every,
feval,
early_stopping,
maximize,
custom_objective,
save_period,
save_name,
xgb_model,
callbacks,
verbose,
num_class,
weight,
na_missing
)boosterthe trainer type, the values are gbtree(default), gblinear, dart:gbtree
objectivespecify the learning task. Check the link above for all possible values.
nthreadnumber of parallel threads used to run, default is to run using all threads available
silent0 means printing running messages, 1 means silent mode
n_estimatorsnumber of trees to grow, default = 100
learning_rateStep size shrinkage used in update to prevents overfitting. Lower the learning rate, more time it takes in training, value lies between between 0 and 1. Default = 0.3
gammaMinimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be. Value lies between 0 and infinity, Default = 0
max_depththe maximum depth of each tree, default = 6
min_child_weightMinimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression task, this simply corresponds to minimum number of instances needed to be in each node. The larger min_child_weight is, the more conservative the algorithm will be. Value lies between 0 and infinity. Default = 1
subsampleSubsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration. Value lies between 0 and 1. Default = 1
colsample_bytreeSubsample ratio of columns when constructing each tree. Subsampling will occur once in every boosting iteration. Value lies between 0 and 1. Default = 1
lambdaL2 regularization term on weights. Increasing this value will make model more conservative. Default = 1
alphaL1 regularization term on weights. Increasing this value will make model more conservative. Default = 0
eval_metricEvaluation metrics for validation data, a default metric will be assigned according to objective
print_everyprint training log after n iterations. Default = 50
fevalcustom evaluation function
early_stoppingUsed to prevent overfitting, stops model training after this number of iterations if there is no improvement seen
maximizeIf feval and early_stopping_rounds are set, then this parameter must be set as well. When it is TRUE, it means the larger the evaluation score the better.
custom_objectivecustom objective function
save_periodwhen it is non-NULL, model is saved to disk after every save_period rounds, 0 means save at the end.
save_namethe name or path for periodically saved model file.
xgb_modela previously built model to continue the training from. Could be either an object of class xgb.Booster, or its raw data, or the name of a file with a previously saved model.
callbacksa list of callback functions to perform various task during boosting. See callbacks. Some of the callbacks are automatically created depending on the parameters' values. User can provide either existing or their own callback methods in order to customize the training process.
verboseIf 0, xgboost will stay silent. If 1, xgboost will print information of performance. If 2, xgboost will print some additional information. Setting verbose > 0 automatically engages the cb.evaluation.log and cb.print.evaluation callback functions.
num_classset number of classes in case of multiclassification problem
weighta vector indicating the weight for each row of the input.
na_missingby default is set to NA, which means that NA values should be considered as 'missing' by the algorithm. Sometimes, 0 or other extreme value might be used to represent missing values. This parameter is only used when input is a dense matrix.
library(data.table)
df <- copy(iris)
# convert characters/factors to numeric
df$Species <- as.numeric(as.factor(df$Species))-1
# initialise model
xgb <- XGBTrainer$new(objective = 'multi:softmax',
maximize = FALSE,
eval_metric = 'merror',
num_class=3,
n_estimators = 2)cross_val()Xdata.frame
ycharacter, name of target variable
nfoldsinteger, number of folds
stratifiedlogical, whether to use stratified sampling
foldsthe list of CV folds' indices - either those passed through the folds parameter or randomly generated.
\dontrun{
library(data.table)
df <- copy(iris)
# convert characters/factors to numeric
df$Species <- as.numeric(as.factor(df$Species))-1
# initialise model
xgb <- XGBTrainer$new(objective = 'multi:softmax',
maximize = FALSE,
eval_metric = 'merror',
num_class=3,
n_estimators = 2)
# do cross validation to find optimal value for n_estimators
xgb$cross_val(X = df, y = 'Species',nfolds = 3, stratified = TRUE)
}
fit()Xdata.frame, training data
ycharacter, name of target variable
validdata.frame, validation data
library(data.table)
df <- copy(iris)
# convert characters/factors to numeric
df$Species <- as.numeric(as.factor(df$Species))-1
# initialise model
xgb <- XGBTrainer$new(objective = 'multi:softmax',
maximize = FALSE,
eval_metric = 'merror',
num_class=3,
n_estimators = 2)
xgb$fit(df, 'Species')predict()#' library(data.table)
df <- copy(iris)
# convert characters/factors to numeric
df$Species <- as.numeric(as.factor(df$Species))-1
# initialise model
xgb <- XGBTrainer$new(objective = 'multi:softmax',
maximize = FALSE,
eval_metric = 'merror',
num_class=3,
n_estimators = 2)
xgb$fit(df, 'Species')
# make predictions
preds <- xgb$predict(as.matrix(iris[,1:4]))show_importance()\dontrun{
library(data.table)
df <- copy(iris)
# convert characters/factors to numeric
df$Species <- as.numeric(as.factor(df$Species))-1
# initialise model
xgb <- XGBTrainer$new(objective = 'multi:softmax',
maximize = FALSE,
eval_metric = 'merror',
num_class=3,
n_estimators = 2)
xgb$fit(df, 'Species')
xgb$show_importance()
}
## ------------------------------------------------
## Method `XGBTrainer$new`
## ------------------------------------------------
library(data.table)
df <- copy(iris)
# convert characters/factors to numeric
df$Species <- as.numeric(as.factor(df$Species))-1
# initialise model
xgb <- XGBTrainer$new(objective = 'multi:softmax',
maximize = FALSE,
eval_metric = 'merror',
num_class=3,
n_estimators = 2)
## ------------------------------------------------
## Method `XGBTrainer$cross_val`
## ------------------------------------------------
if (FALSE) {
library(data.table)
df <- copy(iris)
# convert characters/factors to numeric
df$Species <- as.numeric(as.factor(df$Species))-1
# initialise model
xgb <- XGBTrainer$new(objective = 'multi:softmax',
maximize = FALSE,
eval_metric = 'merror',
num_class=3,
n_estimators = 2)
# do cross validation to find optimal value for n_estimators
xgb$cross_val(X = df, y = 'Species',nfolds = 3, stratified = TRUE)
}
## ------------------------------------------------
## Method `XGBTrainer$fit`
## ------------------------------------------------
library(data.table)
df <- copy(iris)
# convert characters/factors to numeric
df$Species <- as.numeric(as.factor(df$Species))-1
# initialise model
xgb <- XGBTrainer$new(objective = 'multi:softmax',
maximize = FALSE,
eval_metric = 'merror',
num_class=3,
n_estimators = 2)
xgb$fit(df, 'Species')
#> converting the data into xgboost format..
#> starting with training...
#> [10:48:34] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-merror:0.020000
#> Will train until train_merror hasn't improved in 50 rounds.
#>
#> [2] train-merror:0.026667
## ------------------------------------------------
## Method `XGBTrainer$predict`
## ------------------------------------------------
#' library(data.table)
df <- copy(iris)
# convert characters/factors to numeric
df$Species <- as.numeric(as.factor(df$Species))-1
# initialise model
xgb <- XGBTrainer$new(objective = 'multi:softmax',
maximize = FALSE,
eval_metric = 'merror',
num_class=3,
n_estimators = 2)
xgb$fit(df, 'Species')
#> converting the data into xgboost format..
#> starting with training...
#> [10:48:34] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-merror:0.020000
#> Will train until train_merror hasn't improved in 50 rounds.
#>
#> [2] train-merror:0.026667
# make predictions
preds <- xgb$predict(as.matrix(iris[,1:4]))
## ------------------------------------------------
## Method `XGBTrainer$show_importance`
## ------------------------------------------------
if (FALSE) {
library(data.table)
df <- copy(iris)
# convert characters/factors to numeric
df$Species <- as.numeric(as.factor(df$Species))-1
# initialise model
xgb <- XGBTrainer$new(objective = 'multi:softmax',
maximize = FALSE,
eval_metric = 'merror',
num_class=3,
n_estimators = 2)
xgb$fit(df, 'Species')
xgb$show_importance()
}