Documentation Center

  • Trial Software
  • Product Updates

TreeBagger class

Bootstrap aggregation for ensemble of decision trees

Description

TreeBagger bags an ensemble of decision trees for either classification or regression. Bagging stands for bootstrap aggregation. Every tree in the ensemble is grown on an independently drawn bootstrap replica of input data. Observations not included in this replica are "out of bag" for this tree. To compute prediction of an ensemble of trees for unseen data, TreeBagger takes an average of predictions from individual trees. To estimate the prediction error of the bagged ensemble, you can compute predictions for each tree on its out-of-bag observations, average these predictions over the entire ensemble for each observation and then compare the predicted out-of-bag response with the true value at this observation.

TreeBagger relies on the ClassificationTree and RegressionTree functionality for growing individual trees. In particular, ClassificationTree and RegressionTree accepts the number of features selected at random for each decision split as an optional input argument.

The compact method returns an object of another class, CompactTreeBagger, with sufficient information to make predictions using new data. This information includes the tree ensemble, variable names, and class names (for classification). CompactTreeBagger requires less memory than TreeBagger, but only TreeBagger has methods for growing more trees for the ensemble. Once you grow an ensemble of trees using TreeBagger and no longer need access to the training data, you can opt to work with the compact version of the trained ensemble from then on.

Construction

TreeBaggerCreate ensemble of bagged decision trees

Methods

appendAppend new trees to ensemble
compactCompact ensemble of decision trees
errorError (misclassification probability or MSE)
fillProximitiesProximity matrix for training data
growTreesTrain additional trees and add to ensemble
marginClassification margin
mdsProxMultidimensional scaling of proximity matrix
meanMarginMean classification margin
oobErrorOut-of-bag error
oobMarginOut-of-bag margins
oobMeanMarginOut-of-bag mean margins
oobPredictEnsemble predictions for out-of-bag observations
predictPredict response

Properties

ClassNames

A cell array containing the class names for the response variable Y. This property is empty for regression trees.

ComputeOOBPrediction

A logical flag specifying whether out-of-bag predictions for training observations should be computed. The default is false.

If this flag is true, the following properties are available:

  • OOBIndices

  • OOBInstanceWeight

If this flag is true, the following methods can be called:

  • oobError

  • oobMargin

  • oobMeanMargin

See also oobError, OOBIndices, OOBInstanceWeight, oobMargin, oobMeanMargin.

ComputeOOBVarImp

A logical flag specifying whether out-of-bag estimates of variable importance should be computed. The default is false. If this flag is true, then ComputeOOBPrediction is true as well.

If this flag is true, the following properties are available:

  • OOBPermutedVarDeltaError

  • OOBPermutedVarDeltaMeanMargin

  • OOBPermutedVarCountRaiseMargin

Cost

A matrix with misclassification costs. This property is empty for ensembles of regression trees.

DefaultYfit

Default value returned by predict and oobPredict. The DefaultYfit property controls what predicted value is returned when no prediction is possible. For example, when oobPredict needs to predict for an observation that is in-bag for all trees in the ensemble.

  • For classification, you can set this property to either '' or 'MostPopular'. If you choose 'MostPopular' (the default), the property value becomes the name of the most probably class in the training data. If you choose '', the in-bag observations are excluded from computation of the out-of-bag error and margin.

  • For regression, you can set this property to any numeric scalar. The default value is the mean of the response for the training data. If you set this property to NaN, the in-bag observations are excluded from computation of the out-of-bag error and margin.

DeltaCritDecisionSplit

A numeric array of size 1-by-Nvars of changes in the split criterion summed over splits on each variable, averaged across the entire ensemble of grown trees.

See also ClassificationTree.predictorImportance and RegressionTree.predictorImportance.

FBoot

Fraction of observations that are randomly selected with replacement for each bootstrap replica. The size of each replica is Nobs×FBoot, where Nobs is the number of observations in the training set. The default value is 1.

MergeLeaves

A logical flag specifying whether decision tree leaves with the same parent are merged for splits that do not decrease the total risk. The default value is false.

Method

Method used by trees. The possible values are 'classification' for classification ensembles, and 'regression' for regression ensembles.

MinLeaf

Minimum number of observations per tree leaf. By default, MinLeaf is 1 for classification and 5 for regression. For decision tree training, the MinParent value is set equal to 2*MinLeaf.

NTrees

Scalar value equal to the number of decision trees in the ensemble.

NVarSplit

A numeric array of size 1-by-Nvars, where every element gives a number of splits on this predictor summed over all trees.

NVarToSample

Number of predictor or feature variables to select at random for each decision split. By default, NVarToSample is equal to the square root of the total number of variables for classification, and one third of the total number of variables for regression.

OOBIndices

Logical array of size Nobs-by-NTrees, where Nobs is the number of observations in the training data and NTrees is the number of trees in the ensemble. A true value for the (i,j) element indicates that observation i is out-of-bag for tree j. In other words, observation i was not selected for the training data used to grow tree j.

OOBInstanceWeight

Numeric array of size Nobs-by-1 containing the number of trees used for computing the out-of-bag response for each observation. Nobs is the number of observations in the training data used to create the ensemble.

OOBPermutedVarCountRaiseMargin

A numeric array of size 1-by-Nvars containing a measure of variable importance for each predictor variable (feature). For any variable, the measure is the difference between the number of raised margins and the number of lowered margins if the values of that variable are permuted across the out-of-bag observations. This measure is computed for every tree, then averaged over the entire ensemble and divided by the standard deviation over the entire ensemble. This property is empty for regression trees.

OOBPermutedVarDeltaError

A numeric array of size 1-by-Nvars containing a measure of importance for each predictor variable (feature). For any variable, the measure is the increase in prediction error if the values of that variable are permuted across the out-of-bag observations. This measure is computed for every tree, then averaged over the entire ensemble and divided by the standard deviation over the entire ensemble.

OOBPermutedVarDeltaMeanMargin

A numeric array of size 1-by-Nvars containing a measure of importance for each predictor variable (feature). For any variable, the measure is the decrease in the classification margin if the values of that variable are permuted across the out-of-bag observations. This measure is computed for every tree, then averaged over the entire ensemble and divided by the standard deviation over the entire ensemble. This property is empty for regression trees.

OutlierMeasure

A numeric array of size Nobs-by-1, where Nobs is the number of observations in the training data, containing outlier measures for each observation.

See also CompactTreeBagger.OutlierMeasure.

Prior

A vector with prior probabilities for classes. This property is empty for ensembles of regression trees.

Proximity

A numeric matrix of size Nobs-by-Nobs, where Nobs is the number of observations in the training data, containing measures of the proximity between observations. For any two observations, their proximity is defined as the fraction of trees for which these observations land on the same leaf. This is a symmetric matrix with 1s on the diagonal and off-diagonal elements ranging from 0 to 1.

See also CompactTreeBagger.proximity.

Prune

The Prune property is true if decision trees are pruned and false if they are not. Pruning decision trees is not recommended for ensembles. The default value is false.

See also ClassificationTree.prune and RegressionTree.prune.

SampleWithReplacement

A logical flag specifying if data are sampled for each decision tree with replacement. True if TreeBagger samples data with replacement and false otherwise. True by default.

TreeArgs

Cell array of arguments for fitctree or fitrtree. These arguments are used by TreeBagger when growing new trees for the ensemble.

Trees

A cell array of size NTrees-by-1 containing the trees in the ensemble.

See also NTrees.

VarAssoc

A matrix of size Nvars-by-Nvars with predictive measures of variable association, averaged across the entire ensemble of grown trees. If you grew the ensemble setting 'surrogate' to 'on', this matrix for each tree is filled with predictive measures of association averaged over the surrogate splits. If you grew the ensemble setting 'surrogate' to 'off' (default), VarAssoc is diagonal.

VarNames

A cell array containing the names of the predictor variables (features). TreeBagger takes these names from the optional 'names' parameter. The default names are 'x1', 'x2', etc.

W

Numeric vector of weights of length Nobs, where Nobs is the number of observations (rows) in the training data. TreeBagger uses these weights for growing every decision tree in the ensemble. The default W is ones(Nobs,1).

X

A numeric matrix of size Nobs-by-Nvars, where Nobs is the number of observations (rows) and Nvars is the number of variables (columns) in the training data. This matrix contains the predictor (or feature) values.

Y

An array of true class labels for classification, or response values for regression. Y can be a numeric column vector, a character matrix, or a cell array of strings.

Copy Semantics

Value. To learn how this affects your use of the class, see Comparing Handle and Value Classes in the MATLAB® Object-Oriented Programming documentation.

How To

Was this topic helpful?