For More details see Random Forest in R

Recently there has been a lot of interest in “ensemble learning” — methods that generate many classifiers and aggregate their results. Two well-known methods are

Breiman (2001) proposed

The randomForest package provides an R interface to the Fortran programs by Breiman and Cutler (available at http://www.stat.berkeley.edu/users/breiman/). This article provides a brief introduction to the usage and features of the R functions.

The random forests algorithm (for both classification and regression) is as follows:

1. Draw ntree bootstrap samples from the original data.

2. For each of the bootstrap samples, grow an unpruned classification or regression tree, with the following modification:

3.

based on the training data, by the following:

1. At each bootstrap iteration, predict the data not in the bootstrap sample (what Breiman calls “

2. Aggregate the OOB predictions. (On the average, each data point would be out-of-bag around

Our experience has been that the OOB estimate of error rate is quite accurate, given that enough trees have been grown (otherwise the OOB estimate can bias upward; see Bylander (2002)).

The randomForest package optionally produces two additional pieces of information:

This is a difficult concept to define in general, because the importance of a variable may be due to its (possibly complex) interaction with other variables.

The (i, j) element of the proximity matrix produced by randomForest is

The user interface to random forest is consistent with that of other classification functions such as nnet() (in the nnet package) and svm() (in the e1071 package). (We actually borrowed some of the interface code from those two functions.) There is a formula interface, and

**Introduction**Recently there has been a lot of interest in “ensemble learning” — methods that generate many classifiers and aggregate their results. Two well-known methods are

**boosting**(see, e.g., Shapire et al., 1998) and**bagging**Breiman (1996) of classification trees.**In****boosting, successive trees give extra weight to points****incorrectly predicted by earlier predictors**. In the end,**a weighted vote is taken for predictio**n.**In bagging,****successive trees do not depend on earlier trees — each is independently constructed using a bootstrap****sample of the data set**. In the end,**a simple****majority vote is taken for prediction**.Breiman (2001) proposed

**random forests, which****add an additional layer of randomness to bagging**. In addition to constructing each tree using a different bootstrap sample of the data, random forests change how the classification or regression trees are constructed.**In standard trees, each node is split using the best split among all variables**.**In a random forest, each node is split using the best among a subset of predictors randomly chosen at that node**. This somewhat counter intuitive strategy turns out to perform very well compared to many other classifiers, including discriminant analysis, support vector machines and neural networks, and is**r****obust against overfitting**(Breiman, 2001). In addition, it is very user-friendly in the sense that it has only two parameters (the number of variables in the random subset at each node and the number of trees in the forest), and is usually not very sensitive to their values.The randomForest package provides an R interface to the Fortran programs by Breiman and Cutler (available at http://www.stat.berkeley.edu/users/breiman/). This article provides a brief introduction to the usage and features of the R functions.

**The algorithm**The random forests algorithm (for both classification and regression) is as follows:

1. Draw ntree bootstrap samples from the original data.

2. For each of the bootstrap samples, grow an unpruned classification or regression tree, with the following modification:

**at each node, rather than choosing the best split among all predictors, randomly sample mtry of the predictors and choose the best split from among those variables**. (Bagging can be thought of as the special case of random forests obtained when mtry = p, the number of predictors.)3.

**Predict new data by aggregating the predictions of the ntree trees (i.e., majority votes for**

classification, average for regression). An estimate of the error rate can be obtained,classification, average for regression)

based on the training data, by the following:

1. At each bootstrap iteration, predict the data not in the bootstrap sample (what Breiman calls “

**out-of bag”, or OOB**, data) using the tree grown with the bootstrap sample.2. Aggregate the OOB predictions. (On the average, each data point would be out-of-bag around

**36%**of the times, so aggregate these predictions.) Calculate the error rate, and call it the OOB estimate of error rate.Our experience has been that the OOB estimate of error rate is quite accurate, given that enough trees have been grown (otherwise the OOB estimate can bias upward; see Bylander (2002)).

**Extra information from Random Forests**The randomForest package optionally produces two additional pieces of information:

**a measure of the importance of the predictor variables, and a measure of the internal structure of the data**(the proximity of different data points to one anothe r).**Variable importance**This is a difficult concept to define in general, because the importance of a variable may be due to its (possibly complex) interaction with other variables.

**The random forest algorithm estimates the importance of a variable by looking at how much prediction error increases when (OOB) data for that variable is permuted while all others are left unchanged**. The necessary calculations are carried out tree by tree as the random forest is constructed. (There are actually four different measures of variable importance implemented in the classification code. The reader is referred to Breiman (2002) for their definitions.)**proximity measure**The (i, j) element of the proximity matrix produced by randomForest is

**the fraction of trees in which elements i and j fall in the same terminal node**. The intuition is that “similar” observations should be in the same terminal nodes more often than dissimilar ones. The proximity matrix can be used to identify structure in the data (see Breiman,2002) or for unsupervised learning with random forests (see below).**Usage in R**The user interface to random forest is consistent with that of other classification functions such as nnet() (in the nnet package) and svm() (in the e1071 package). (We actually borrowed some of the interface code from those two functions.) There is a formula interface, and

**predictors can be specified as a matrix or data frame via the x argument, with responses as a vector via the y argument**.**If the response is a factor, randomForest performs classification; if the response is continuous (that is, not a factor), randomForest performs regression**.**If the response is unspecified, randomForest performs unsupervised learning**(see below).**Currently randomForest does not handle ordinal categorical responses**. Note that categorical predictor variables must also be specified as factors (or else they will be wrongly treated as continuous). The randomForest function returns an object of class "randomForest". Details on the components of such an object are provided in the online documentation. Methods provided for the class includes predict and print.