This blog introduces R package GBM. The package gbm in R implements extension to Freund and Schapire's Adaboost algorithm and Friedman's gradient boosting machine. Includes regression methods for least squares, absolute loss, t-distributions loss, quantile regression, logistic, multinomial logistic, Poisson, Cox proportional hazard partial likelihood, Adaboost exponential loss, Huberized hinge loss, and learning to Rank measures.

For more details about Adaboost, see Explaining Adaboost.

AdaBoost, short for "

While every learning algorithm will tend to suit some problem types better than others, and will typically have many different parameters and configurations to be adjusted before achieving optimal performance on a dataset,

Problems in machine learning often suffer from the curse of dimensionality — each sample may consist of a huge number of potential features (for instance, there can be 162,336 Haar features, as used by the Viola–Jones object detection framework, in a 24×24 pixel image window), and evaluating every feature can reduce not only the speed of classifier training and execution, but in fact reduce predictive power, per the

Pseudocode for AdaBoost is shown in Following figure. Here we are given m labeled training examples (x1, y1),...,(xm, ym) where the xi’s are in some domain X , and the labels yi ∈ {−1,+1}. On each round t = 1,...,T, a distribution

**Adaboost**For more details about Adaboost, see Explaining Adaboost.

AdaBoost, short for "

**Adaptive Boosting**", is a machine learning meta-algorithm formulated by Yoav Freund and Robert Schapire. It can be used in conjunction with many other types of learning algorithms to improve their performance. The output of the other learning algorithms ('weak learners') is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers. AdaBoost is sensitive to noisy data and outliers. In some problems, however, it can be**less susceptible to the****overfitting****problem**than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing (i.e., their error rate is smaller than 0.5 for binary classification), the final model can be proven to converge to a strong learner.While every learning algorithm will tend to suit some problem types better than others, and will typically have many different parameters and configurations to be adjusted before achieving optimal performance on a dataset,

**AdaBoost (with****decision trees****as the weak learners) is often referred to as the best out-of-the-box classifier**. When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree growing algorithm such that later trees tend to focus on harder to classify examples.Problems in machine learning often suffer from the curse of dimensionality — each sample may consist of a huge number of potential features (for instance, there can be 162,336 Haar features, as used by the Viola–Jones object detection framework, in a 24×24 pixel image window), and evaluating every feature can reduce not only the speed of classifier training and execution, but in fact reduce predictive power, per the

*. Unlike neural networks and SVMs, the***Hughes Effect****AdaBoost training process selects only those features known to improve the predictive power of the model**, reducing dimensionality and potentially improving execution time as irrelevant features do not need to be computed.Pseudocode for AdaBoost is shown in Following figure. Here we are given m labeled training examples (x1, y1),...,(xm, ym) where the xi’s are in some domain X , and the labels yi ∈ {−1,+1}. On each round t = 1,...,T, a distribution

**Dt**is computed as in the figure over the**m training examples**, and a given weak learner or weak learning algorithm is applied to find a**weak hypothesis ht: X → {−1,+1}**, where the aim of the weak learner is to find a**weak hypothesis with low weighted error εt relative to Dt**. The final or**combined hypothesis H computes the sign of a weighted combination of weak hypotheses.**We continue on considering how the general theory of Vapnik and Chervonenkis can be applied directly to AdaBoost. Intuitively, for a learned classifier to be effective and accurate in its predictions, it should meet three conditions: (1) it should have trained on "enough" training examples; (2) is should provide a good fit to those training examples (low training error); and (3) it should be "simple". This last condition, our expectation that simpler rules are better, is often referred to as Occam's razor.

In formalizing these conditions, Vapnik and Chervonenkis established a foundation for understanding the fundamental nature of learning, laying the ground work for the design of effective and principled learning algorithms. Specifically, they derived upper bounds on the generalization error of a classifier that could be stated in terms of the three conditions above, and along the way, provided workable definitions (such as VC-dimension) of the slippery and mysterious notion of simplicity.

Much of the work studying boosting has been based on the assumption that each of the weak hypotheses has accuracy just a little bit better than random guessing; for two-class problems, this means they should each have error below 1/2, that is, each εt should be at most 1/2 − γ for some γ > 0. This assumption, called the weak learning condition, is intrinsic to the mathematical definition of a boosting algorithm which, given this assumption and sufficient data, can provably produce a final hypothesis with arbitrarily small generalization error.

Given the weak learning condition, it is possible to prove that the training error of AdaBoost's final hypothesis decreases to zero very rapidly; in fact, in just O(log m) rounds (ignoring all other parameters of the problem), the final hypothesis will perfectly fit the training set. Furthermore, we can measure the complexity (that is, lack of simplicity) of the final hypothesis using the VC-dimension which can be computed using combinatorial arguments.

The following figure shows the error (both training and test) of the final hypothesis as a function of the number of rounds of boosting. As noted above, we expect training error to drop very quickly, but at the same time the VC-dimension of the final hypothesis is increasing roughly linearly with the number of rounds T.

In formalizing these conditions, Vapnik and Chervonenkis established a foundation for understanding the fundamental nature of learning, laying the ground work for the design of effective and principled learning algorithms. Specifically, they derived upper bounds on the generalization error of a classifier that could be stated in terms of the three conditions above, and along the way, provided workable definitions (such as VC-dimension) of the slippery and mysterious notion of simplicity.

Much of the work studying boosting has been based on the assumption that each of the weak hypotheses has accuracy just a little bit better than random guessing; for two-class problems, this means they should each have error below 1/2, that is, each εt should be at most 1/2 − γ for some γ > 0. This assumption, called the weak learning condition, is intrinsic to the mathematical definition of a boosting algorithm which, given this assumption and sufficient data, can provably produce a final hypothesis with arbitrarily small generalization error.

Given the weak learning condition, it is possible to prove that the training error of AdaBoost's final hypothesis decreases to zero very rapidly; in fact, in just O(log m) rounds (ignoring all other parameters of the problem), the final hypothesis will perfectly fit the training set. Furthermore, we can measure the complexity (that is, lack of simplicity) of the final hypothesis using the VC-dimension which can be computed using combinatorial arguments.

**Having analyzed both the complexity and training fit of the final hypothesis, one can immediately apply the VC theory to obtain a bound on its generalization error.**The following figure shows the error (both training and test) of the final hypothesis as a function of the number of rounds of boosting. As noted above, we expect training error to drop very quickly, but at the same time the VC-dimension of the final hypothesis is increasing roughly linearly with the number of rounds T.

**Thus, with improved fit to the training set, the test error drops at first, but then rises again a a result of the final hypothesis becoming overly complex. This is classic overfitting behavior.****Gradient Boosting Machine**

Gradient boosting is a machine learning technique for regression problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. The gradient boosting method can also be used for classification problems by reducing them to regression with a suitable loss function.