Documentation

TreeBagger

Class: TreeBagger

Create ensemble of bagged decision trees

Syntax

B = TreeBagger(NumTrees,Tbl,ResponseVarName)
B = TreeBagger(NumTrees,Tbl,Formula)
B = TreeBagger(NumTrees,Tbl,Y)
B = TreeBagger(NumTrees,X,Y)
B = TreeBagger(NumTrees,X,Y,Name,Value)

Description

B = TreeBagger(NumTrees,Tbl,ResponseVarName) creates an ensemble for predicting the responses stored in ResponseVarName as a function of the predictors in the table Tbl, where ResponseVarName is the name of a variable in Tbl. By default TreeBagger builds an ensemble of classification trees. The function can build an ensemble of regression trees by setting the optional input argument 'Method' to 'regression'.

B = TreeBagger(NumTrees,Tbl,Formula) creates an ensemble B of NumTrees using the formula string Formula to specify the response and predictor variables in Tbl. Specify Formula using Wilkinson notation. For more information, see Wilkinson Notation.

B = TreeBagger(NumTrees,Tbl,Y) creates an ensemble B of NumTrees decision trees for predicting responses in vector Y as a function of the predictors stored in the table Tbl.

Y is an array of response data. Elements of Y correspond to the rows of Tbl or X. For classification, Y is the set of true class labels. Labels can be any grouping variable, that is, a numeric or logical vector, character matrix, cell vector of strings, or categorical vector. TreeBagger converts labels to a cell array of strings for classification. For regression, Y is a numeric vector.

B = TreeBagger(NumTrees,X,Y) creates an ensemble B of NumTrees decision trees for predicting response Y as a function of predictors in the numeric matrix of training data, X. Each row in X represents an observation and each column represents a predictor or feature.

B = TreeBagger(NumTrees,X,Y,Name,Value) specifies optional parameter name-value pairs:

'InBagFraction'Fraction of input data to sample with replacement from the input data for growing each new tree. Default value is 1.
'Cost'

Square matrix C, where C(i,j) is the cost of classifying a point into class j if its true class is i (i.e., the rows correspond to the true class and the columns correspond to the predicted class). The order of the rows and columns of Cost corresponds to the order of the classes in the ClassNames property of the trained TreeBagger model B.

Alternatively, cost can be a structure S having two fields:

  • S.ClassNames containing the group names as a categorical variable, character array, or cell array of strings

  • S.ClassificationCosts containing the cost matrix C

The default value is C(i,j) = 1 if i ~= j, and C(i,j) = 0 if i = j.

If Cost is highly skewed, then, for in-bag samples, the software oversamples unique observations from the class that has a large penalty. For smaller sample sizes, this might cause a very low relative frequency of out-of-bag observations from the class that has a large penalty. Therefore, the estimated out-of-bag error is highly variable, and might be difficult to interpret.

'SampleWithReplacement''on' to sample with replacement or 'off' to sample without replacement. If you sample without replacement, you need to set 'InBagFraction' to a value less than one. Default is 'on'.
'OOBPrediction''on' to store info on what observations are out of bag for each tree. This info can be used by oobPrediction to compute the predicted class probabilities for each tree in the ensemble. Default is 'off'.
'OOBPredictorImportance''on' to store out-of-bag estimates of feature importance in the ensemble. Default is 'off'. Specifying 'on' also sets the 'OOBPrediction' value to 'on'.
'Method'Either 'classification' or 'regression'. Regression requires a numeric Y.
'NumPredictorsToSample'Number of variables to select at random for each decision split. Default is the square root of the number of variables for classification and one third of the number of variables for regression. Valid values are 'all' or a positive integer. Setting this argument to any valid value but 'all' invokes Breiman's 'random forest' algorithm.
'NumPrint'Number of training cycles (grown trees) after which TreeBagger displays a diagnostic message showing training progress. Default is no diagnostic messages.
'MinLeafSize'Minimum number of observations per tree leaf. Default is 1 for classification and 5 for regression.
'Options'A structure that specifies options that govern the computation when growing the ensemble of decision trees. One option requests that the computation of decision trees on multiple bootstrap replicates uses multiple processors, if the Parallel Computing Toolbox™ is available. Two options specify the random number streams to use in selecting bootstrap replicates. You can create this argument with a call to statset. You can retrieve values of the individual fields with a call to statget. Applicable statset parameters are:
  • 'UseParallel' — If true and if a parpool of the Parallel Computing Toolbox is open, compute decision trees drawn on separate boostrap replicates in parallel. If the Parallel Computing Toolbox is not installed, or a parpool is not open, computation occurs in serial mode. Default is false, or serial computation.

  • 'UseSubstreams' — If true select each bootstrap replicate using a separate Substream of the random number generator (aka Stream). This option is available only with RandStream types that support Substreams: 'mlfg6331_64' or 'mrg32k3a'. Default is false, do not use a different Substream to compute each bootstrap replicate.

  • Streams — A RandStream object or cell array of such objects. If you do not specify Streams, TreeBagger uses the default stream or streams. If you choose to specify Streams, use a single object except in the case

    • You have an open Parallel pool

    • UseParallel is true

    • UseSubstreams is false

    In that case, use a cell array the same size as the Parallel pool.

'Prior'

Prior probabilities for each class. Specify as one of:

  • A string:

    • 'Empirical' determines class probabilities from class frequencies in Y. If you pass observation weights, they are used to compute the class probabilities. This is the default.

    • 'Uniform' sets all class probabilities equal.

  • A vector (one scalar value for each class). The order of the elements Prior corresponds to the order of the classes in the ClassNames property of the trained TreeBagger model B.

  • A structure S with two fields:

    • S.ClassNames containing the class names as a categorical variable, character array, or cell array of strings

    • S.ClassProbs containing a vector of corresponding probabilities

If you set values for both Weights and Prior, the weights are renormalized to add up to the value of the prior probability in the respective class.

If Prior is highly skewed, then, for in-bag samples, the software oversamples unique observations from the class that has a large prior probability. For smaller sample sizes, this might cause a very low relative frequency of out-of-bag observations from the class that has a large prior probability. Therefore, the estimated out-of-bag error is highly variable, and might be difficult to interpret.

'CategoricalPredictors'

Categorical predictors list, specified as the comma-separated pair consisting of 'CategoricalPredictors' and one of the following.

  • A numeric vector with indices from 1 to p, where p is the number of columns of X.

  • A logical vector of length p, where a true entry means that the corresponding column of X is a categorical variable.

  • A cell array of strings, where each element in the array is the name of a predictor variable. The names must match entries in PredictorNames values.

  • A character matrix, where each row of the matrix is a name of a predictor variable. The names must match entries in PredictorNames values. Pad the names with extra blanks so each row of the character matrix has the same length.

  • 'all', meaning all predictors are categorical.

In addition to the optional arguments above, this method accepts all optional fitctree and fitrtree arguments with the exception of 'MinParent'. Refer to the documentation for fitctree and fitrtree for more detail.

Examples

expand all

Train Bagged Ensemble of Classification Trees

Load Fisher's iris data set.

load fisheriris

Train a bagged ensemble of classification trees using the data and specifying 50 weak learners. Store which observations are out of bag for each tree.

rng(1); % For reproducibility
BaggedEnsemble = TreeBagger(50,meas,species,'OOBPrediction','On',...
    'Method','classification')
BaggedEnsemble = 

  TreeBagger
Ensemble with 50 bagged decision trees:
                    Training X:              [150x4]
                    Training Y:              [150x1]
                        Method:       classification
                 NumPredictors:                    4
         NumPredictorsToSample:                    2
                   MinLeafSize:                    1
                 InBagFraction:                    1
         SampleWithReplacement:                    1
          ComputeOOBPrediction:                    1
 ComputeOOBPredictorImportance:                    0
                     Proximity:                   []
                    ClassNames:        'setosa'    'versicolor'     'virginica'

BaggedEnsemble is a TreeBagger ensemble.

BaggedEnsemble.Trees is the property that stores a 50-by-1 cell vector of the trained classification trees (CompactClassificationTree model objects) that compose the ensemble.

Plot a graph of the first trained classification tree.

view(BaggedEnsemble.Trees{1},'Mode','graph')

By default, TreeBagger grows deep trees.

BaggedEnsemble.OOBIndices is the property that stores the out-of-bag indices as a matrix of logical values.

Plot the out-of-bag error over the number of grown classification trees.

oobErrorBaggedEnsemble = oobError(BaggedEnsemble);
plot(oobErrorBaggedEnsemble)
xlabel 'Number of grown trees';
ylabel 'Out-of-bag classification error';

The out-of-bag error decreases with the number of grown trees.

To label out-of-bag observations, pass BaggedEnsemble to oobPredict.

Algorithms

TreeBagger generates in-bag samples by oversampling classes with large misclassification costs and undersampling classes with small misclassification costs. Consequently, out-of-bag samples have fewer observations from classes with large misclassification costs and more observations from classes with small misclassification costs. If you train a classification ensemble using a small data set and a highly skewed cost matrix, then the number of out-of-bag observations per class might be very low. Therefore, the estimated out-of-bag error might have a large variance and might be difficult to interpret. The same phenomenon can occur for classes with large prior probabilities.

Tips

  • Avoid large estimated out-of-bag error variances by setting a more balanced misclassification cost matrix or a less skewed prior probability vector.

  • The Trees property of B stores a cell vector of B.NumTrees CompactClassificationTree or CompactRegressionTree model objects. For a textual or graphical display of tree t in the cell vector, enter

    view(B.Trees{t})

Introduced in R2009a

Was this topic helpful?