Individual decision trees tend to overfit. Bootstrapaggregated
(bagged) decision trees combine the results
of many decision trees, which reduces the effects of overfitting and
improves generalization. TreeBagger
grows the decision
trees in the ensemble using a bootstrap samples of the data. Also, TreeBagger
selects
a random subset of predictors to use at each decision split as in
the random forest algorithm [1].
By default, TreeBagger
bags classification
trees. To bag regression trees instead, specify 'Method','regression'
.
For regression problems, TreeBagger
supports
mean and quantile regression (that is, quantile regression forest [5]).
Mdl = TreeBagger(NumTrees,Tbl,ResponseVarName)
Mdl = TreeBagger(NumTrees,Tbl,formula)
Mdl = TreeBagger(NumTrees,Tbl,Y)
B = TreeBagger(NumTrees,X,Y)
B = TreeBagger(NumTrees,X,Y,Name,Value
)
returns
a ensemble of Mdl
= TreeBagger(NumTrees
,Tbl
,ResponseVarName
)NumTrees
bagged classification trees
trained using the sample data in the table Tbl
. ResponseVarName
is
the name of the response variable in Tbl
.
returns
an ensemble of bagged classification trees trained using the sample
data in the table Mdl
= TreeBagger(NumTrees
,Tbl
,formula
)Tbl
. formula
is
an explanatory model of the response and a subset of predictor variables
in Tbl
used to fit Mdl
. Specify Formula
using
Wilkinson notation. For more information, see Wilkinson Notation.
returns
an ensemble of classification trees using the predictor variables
in table Mdl
= TreeBagger(NumTrees
,Tbl
,Y
)Tbl
and class labels in vector Y
.
Y
is an array of response data. Elements
of Y
correspond to the rows of Tbl
.
For classification, Y
is the set of true class
labels. Labels can be any grouping variable,
that is, a numeric or logical vector, character matrix, cell vector
of character vectors, or categorical vector. TreeBagger
converts
labels to a cell array of character vectors. For regression, Y
is
a numeric vector. To grow regression trees, you must specify the namevalue
pair 'Method','regression'
.
creates
an ensemble B
= TreeBagger(NumTrees
,X
,Y
)B
of NumTrees
decision
trees for predicting response Y
as a function of
predictors in the numeric matrix of training data, X
.
Each row in X
represents an observation and each
column represents a predictor or feature.
B = TreeBagger(NumTrees,X,Y,
specifies
optional parameter namevalue pairs: Name,Value
)
'InBagFraction'  Fraction of input data to sample with replacement from the input data for growing each new tree. Default value is 1. 
'Cost'  Square matrix Alternatively,
The default value is If 
'SampleWithReplacement'  'on' to sample with replacement or 'off' to
sample without replacement. If you sample without replacement, you
need to set 'InBagFraction' to a value less than
one. Default is 'on' . 
'OOBPrediction'  'on' to store info on what observations
are out of bag for each tree. This info can be used by oobPrediction to
compute the predicted class probabilities for each tree in the ensemble.
Default is 'off' . 
'OOBPredictorImportance'  'on' to store outofbag estimates of feature
importance in the ensemble. Default is 'off' . Specifying 'on' also
sets the 'OOBPrediction' value to 'on' .
If an analysis of predictor importance is your goal, then also specify 'PredictorSelection','curvature' or 'PredictorSelection','interactioncurvature' .
For more details, see fitctree or fitrtree . 
'Method'  Either 'classification' or 'regression' .
Regression requires a numeric Y . 
'NumPredictorsToSample'  Number of variables to select at random for each decision split.
Default is the square root of the number of variables for classification
and one third of the number of variables for regression. Valid values
are 'all' or a positive integer. Setting this argument
to any valid value but 'all' invokes Breiman's
random forest algorithm [1]. 
'NumPrint'  Number of training cycles (grown trees) after which TreeBagger displays
a diagnostic message showing training progress. Default is no diagnostic
messages. 
'MinLeafSize'  Minimum number of observations per tree leaf. Default is 1 for classification and 5 for regression. 
'Options'  A structure that specifies options that govern the computation
when growing the ensemble of decision trees. One option requests that
the computation of decision trees on multiple bootstrap replicates
uses multiple processors, if the Parallel Computing Toolbox™ is
available. Two options specify the random number streams to use in
selecting bootstrap replicates. You can create this argument with
a call to statset . You can retrieve values of the
individual fields with a call to statget .
Applicable statset parameters
are:

'Prior'  Prior probabilities for each class. Specify as one of:
If you set values for both If 
'CategoricalPredictors'  Categorical predictors list, specified as the commaseparated
pair consisting of

In addition to the optional
arguments above, this method accepts all optional fitctree
and fitrtree
arguments
with the exception of 'MinParentSize'
. Refer to
the documentation for fitctree
and fitrtree
for more detail.
TreeBagger
generates inbag samples
by oversampling classes with large misclassification costs and undersampling
classes with small misclassification costs. Consequently, outofbag
samples have fewer observations from classes with large misclassification
costs and more observations from classes with small misclassification
costs. If you train a classification ensemble using a small data set
and a highly skewed cost matrix, then the number of outofbag observations
per class might be very low. Therefore, the estimated outofbag error
might have a large variance and might be difficult to interpret. The
same phenomenon can occur for classes with large prior probabilities.
For details on selecting split predictors and nodesplitting algorithms when growing decision trees, see Algorithms for classification trees and Algorithms for regression trees.
Avoid large estimated outofbag error variances by setting a more balanced misclassification cost matrix or a less skewed prior probability vector.
The Trees
property of B
stores
a cell vector of B.NumTrees
CompactClassificationTree
or CompactRegressionTree
model
objects. For a textual or graphical display of tree t
in
the cell vector, enter
view(B.Trees{t})
Standard CART tends to select split predictors containing many distinct values, e.g., continuous variables, over those containing few distinct values, e.g., categorical variables [4]. Consider specifying the curvature or interaction test if any of the following are true:
If there are predictors that have relatively fewer distinct values than other predictors, for example, if the predictor data set is heterogeneous.
If an analysis of predictor importance is your goal. TreeBagger
stores
predictor importance estimates in the OOBPermutedPredictorDeltaError
property
of Mdl
.
For more information on predictor selection, see PredictorSelection
for
classification trees or PredictorSelection
for
regression trees.
[1] Breiman, L. Random Forests. Machine Learning 45, pp. 5–32, 2001.
[2] Breiman, L., J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Boca Raton, FL: CRC Press, 1984.
[3] Loh, W.Y. "Regression Trees with Unbiased Variable Selection and Interaction Detection." Statistica Sinica, Vol. 12, 2002, pp. 361–386.
[4] Loh, W.Y. and Y.S. Shih. "Split Selection Methods for Classification Trees." Statistica Sinica, Vol. 7, 1997, pp. 815–840.
[5] Meinshausen, N. "Quantile Regression Forests." Journal of Machine Learning Research, Vol. 7, 2006, pp. 983–999.