Create bag of decision trees
Individual decision trees tend to overfit. Bootstrap-aggregated
(bagged) decision trees combine the results
of many decision trees, which reduces the effects of overfitting and
TreeBagger grows the decision
trees in the ensemble using a bootstrap samples of the data. Also,
a random subset of predictors to use at each decision split as in
the random forest algorithm .
TreeBagger bags classification
trees. To bag regression trees instead, specify
For regression problems,
mean and quantile regression (that is, quantile regression forest ).
Mdl = TreeBagger(NumTrees,Tbl,ResponseVarName)
Mdl = TreeBagger(NumTrees,Tbl,formula)
Mdl = TreeBagger(NumTrees,Tbl,Y)
B = TreeBagger(NumTrees,X,Y)
B = TreeBagger(NumTrees,X,Y,Name,Value)
a ensemble of
Mdl = TreeBagger(
NumTrees bagged classification trees
trained using the sample data in the table
the name of the response variable in
an ensemble of bagged classification trees trained using the sample
data in the table
Mdl = TreeBagger(
an explanatory model of the response and a subset of predictor variables
Tbl used to fit
Wilkinson notation. For more information, see Wilkinson Notation.
an ensemble of classification trees using the predictor variables
Mdl = TreeBagger(
Tbl and class labels in vector
Y is an array of response data. Elements
Y correspond to the rows of
Y is the set of true class
labels. Labels can be any grouping variable,
that is, a numeric or logical vector, character matrix, cell vector
of character vectors, or categorical vector.
labels to a cell array of character vectors. For regression,
a numeric vector. To grow regression trees, you must specify the name-value
B = TreeBagger(
trees for predicting response
Y as a function of
predictors in the numeric matrix of training data,
Each row in
X represents an observation and each
column represents a predictor or feature.
B = TreeBagger(NumTrees,X,Y,Name,Value) specifies
optional parameter name-value pairs:
|Fraction of input data to sample with replacement from the input data for growing each new tree. Default value is 1.|
The default value is
|Number of variables to select at random for each decision split.
Default is the square root of the number of variables for classification
and one third of the number of variables for regression. Valid values
|Number of training cycles (grown trees) after which |
|Minimum number of observations per tree leaf. Default is 1 for classification and 5 for regression.|
|A structure that specifies options that govern the computation
when growing the ensemble of decision trees. One option requests that
the computation of decision trees on multiple bootstrap replicates
uses multiple processors, if the Parallel
Computing Toolbox™ is
available. Two options specify the random number streams to use in
selecting bootstrap replicates. You can create this argument with
a call to |
Prior probabilities for each class. Specify as one of:
If you set values for both
Predictor variable names, specified as the comma-separated
pair consisting of
Categorical predictors list, specified as the comma-separated
pair consisting of
Chunk size, specified as the comma-separated pair consisting
This option only applies when using
Load Fisher's iris data set.
Train an ensemble of bagged classification trees using the entire data set. Specify
50 weak learners. Store which observations are out of bag for each tree.
rng(1); % For reproducibility Mdl = TreeBagger(50,meas,species,'OOBPrediction','On',... 'Method','classification')
Mdl = TreeBagger Ensemble with 50 bagged decision trees: Training X: [150x4] Training Y: [150x1] Method: classification NumPredictors: 4 NumPredictorsToSample: 2 MinLeafSize: 1 InBagFraction: 1 SampleWithReplacement: 1 ComputeOOBPrediction: 1 ComputeOOBPredictorImportance: 0 Proximity:  ClassNames: 'setosa' 'versicolor' 'virginica'
Mdl is a
Mdl.Trees stores a 50-by-1 cell vector of the trained classification trees (
CompactClassificationTree model objects) that compose the ensemble.
Plot a graph of the first trained classification tree.
TreeBagger grows deep trees.
Mdl.OOBIndices stores the out-of-bag indices as a matrix of logical values.
Plot the out-of-bag error over the number of grown classification trees.
figure; oobErrorBaggedEnsemble = oobError(Mdl); plot(oobErrorBaggedEnsemble) xlabel 'Number of grown trees'; ylabel 'Out-of-bag classification error';
The out-of-bag error decreases with the number of grown trees.
To label out-of-bag observations, pass
carsmall data set. Consider a model that predicts the fuel economy of a car given its engine displacement.
Train an ensemble of bagged regression trees using the entire data set. Specify 100 weak learners.
rng(1); % For reproducibility Mdl = TreeBagger(100,Displacement,MPG,'Method','regression');
Mdl is a
Using a trained bag of regression trees, you can estimate conditional mean responses or perform quantile regression to predict conditional quantiles.
For ten equally-spaced engine displacements between the minimum and maximum in-sample displacement, predict conditional mean responses and conditional quartiles.
predX = linspace(min(Displacement),max(Displacement),10)'; mpgMean = predict(Mdl,predX); mpgQuartiles = quantilePredict(Mdl,predX,'Quantile',[0.25,0.5,0.75]);
Plot the observations, and estimated mean responses and quartiles in the same figure.
figure; plot(Displacement,MPG,'o'); hold on plot(predX,mpgMean); plot(predX,mpgQuartiles); ylabel('Fuel economy'); xlabel('Engine displacement'); legend('Data','Mean Response','First quartile','Median','Third quartile');
carsmall data set. Consider a model that predicts the mean fuel economy of a car given its acceleration, number of cylinders, engine displacement, horsepower, manufacturer, model year, and weight. Consider
Model_Year as categorical variables.
load carsmall Cylinders = categorical(Cylinders); Mfg = categorical(cellstr(Mfg)); Model_Year = categorical(Model_Year); X = table(Acceleration,Cylinders,Displacement,Horsepower,Mfg,... Model_Year,Weight,MPG); rng('default'); % For reproducibility
Display the number of categories represented in the categorical variables.
numCylinders = numel(categories(Cylinders)) numMfg = numel(categories(Mfg)) numModelYear = numel(categories(Model_Year))
numCylinders = 3 numMfg = 28 numModelYear = 3
Because there are 3 categories only in
Model_Year, the standard CART, predictor-splitting algorithm prefers splitting a continuous predictor over these two variables.
Train a random forest of 200 regression trees using the entire data set. To grow unbiased trees, specify usage of the curvature test for splitting predictors. Because there are missing values in the data, specify usage of surrogate splits. Store the out-of-bag information for predictor importance estimation.
Mdl = TreeBagger(200,X,'MPG','Method','regression','Surrogate','on',... 'PredictorSelection','curvature','OOBPredictorImportance','on');
TreeBagger stores predictor importance estimates in the property
OOBPermutedPredictorDeltaError. Compare the estimates using a bar graph.
imp = Mdl.OOBPermutedPredictorDeltaError; figure; bar(imp); title('Curvature Test'); ylabel('Predictor importance estimates'); xlabel('Predictors'); h = gca; h.XTickLabel = Mdl.PredictorNames; h.XTickLabelRotation = 45; h.TickLabelInterpreter = 'none';
In this case,
Model_Year is the most important predictor, followed by
imp to predictor importance estimates computed from a random forest that grows trees using standard CART.
MdlCART = TreeBagger(200,X,'MPG','Method','regression','Surrogate','on',... 'OOBPredictorImportance','on'); impCART = MdlCART.OOBPermutedPredictorDeltaError; figure; bar(impCART); title('Standard CART'); ylabel('Predictor importance estimates'); xlabel('Predictors'); h = gca; h.XTickLabel = Mdl.PredictorNames; h.XTickLabelRotation = 45; h.TickLabelInterpreter = 'none';
In this case,
Weight, a continuous predictor, is the most important. The next two most importance predictor are
Model_Year followed closely by
Horsepower, which is a continuous predictor.
Avoid large estimated out-of-bag error variances by setting a more balanced misclassification cost matrix or a less skewed prior probability vector.
Standard CART tends to select split predictors containing many distinct values, e.g., continuous variables, over those containing few distinct values, e.g., categorical variables . Consider specifying the curvature or interaction test if any of the following are true:
If there are predictors that have relatively fewer distinct values than other predictors, for example, if the predictor data set is heterogeneous.
If an analysis of predictor importance is your goal.
predictor importance estimates in the
TreeBagger generates in-bag samples
by oversampling classes with large misclassification costs and undersampling
classes with small misclassification costs. Consequently, out-of-bag
samples have fewer observations from classes with large misclassification
costs and more observations from classes with small misclassification
costs. If you train a classification ensemble using a small data set
and a highly skewed cost matrix, then the number of out-of-bag observations
per class might be very low. Therefore, the estimated out-of-bag error
might have a large variance and might be difficult to interpret. The
same phenomenon can occur for classes with large prior probabilities.
 Breiman, L. Random Forests. Machine Learning 45, pp. 5–32, 2001.
 Breiman, L., J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Boca Raton, FL: CRC Press, 1984.
 Loh, W.Y. “Regression Trees with Unbiased Variable Selection and Interaction Detection.” Statistica Sinica, Vol. 12, 2002, pp. 361–386.
 Loh, W.Y. and Y.S. Shih. “Split Selection Methods for Classification Trees.” Statistica Sinica, Vol. 7, 1997, pp. 815–840.
 Meinshausen, N. “Quantile Regression Forests.” Journal of Machine Learning Research, Vol. 7, 2006, pp. 983–999.
This function supports tall arrays for out-of-memory data with the limitations:
Supported syntaxes for tall
B = TreeBagger(NumTrees,Tbl,Y)
B = TreeBagger(NumTrees,X,Y)
B = TreeBagger(___,Name,Value)
For tall arrays,
classification. Regression is not supported.
Supported name-value pairs are:
Default value is the square root of the number of variables for classification.
'MinLeafSize' — Default
1 if the number of observations is less
than 50,000. If the number of observations is larger than 50,000,
then the default value is
'ChunkSize' (only for tall arrays)
— Default value is
TreeBagger supports these
optional arguments of
For tall data,
that contains most of the same properties as a full
The main difference is that the compact object is more memory efficient.
The compact object does not include properties that include the data,
or that include an array of the same size as the data.
predict methods do not support the name-value
additionally do not support
TreeBagger creates a random forest
by generating trees on disjoint chunks of the data. When more data
is available than is required to create the random forest, the data
is subsampled. For a similar example, see Random Forests for Big Data (Genuer,
Poggi, Tuleau-Malot, Villa-Vialaneix 2015).
Depending on how the data is stored, it is possible that some
chunks of data contain observations from only a few classes out of
all the classes. In this case,
produce inferior results compared to the case where each chunk of
data contains observations from most of the classes.
For more information, see Tall Arrays (MATLAB).