Products & Services Solutions Academia Support User Community Company

Learn more about Statistics Toolbox   

Regression and Classification by Bagging Decision Trees

Introduction

Bagging, which stands for "bootstrap aggregation", is a type of ensemble learning. To bag a weak learner such as a decision tree on a dataset, generate many bootstrap replicas of this dataset and grow decision trees on these replicas. Obtain each bootstrap replica by randomly selecting N observations out of N with replacement, where N is the dataset size. To find the predicted response of a trained ensemble, take an average over predictions from individual trees.

Bagging works by reducing variance of an unbiased base learner such as a decision tree. The base learner must be unstable; its configuration must vary significantly from one bootstrap replica to another. A decision tree with fine leaves satisfies this requirement.

In the process of training decision trees on bootstrap replicas of input data, you can also randomly select input variables, or features, for each decision split. Instead of attempting splits on all variables at each node of a decision tree, the algorithm only looks at feature subsets of fixed size randomly selected from input variables. For example, if data has 10 input features, you can select 3 features at random for each decision split. If the best split uses one of the omitted seven variables, it is suboptimal. Surprisingly, this technique tends to improve the predictive power of the ensemble, as random selection of features reduces correlation between trees in the ensemble and increases the overall predictive power.

Several features of bagged decision trees make them a unique algorithm. Drawing N out of N observations with replacement omits on average 37% of observations for each decision tree. These are "out-of-bag" observations. You can use them to estimate the predictive power and feature importance. For each observation, you can estimate the out-of-bag prediction by averaging over predictions from all trees in the ensemble for which this observation is out of bag. You can then compare the computed prediction against the true response for this observation. By comparing the out-of-bag predicted responses against the true responses for all observations used for training, you can estimate the average out-of-bag error. This out-of-bag average is an unbiased estimator of the true ensemble error. You can also obtain out-of-bag estimates of feature importance by randomly permuting out-of-bag data across one variable or column at a time and estimating the increase in the out-of-bag error due to this permutation. The larger the increase, the more important the feature. Thus, you do not need to supply test data for bagged ensembles because you obtain reliable estimates of the predictive power and feature importance in the process of training, which is an attractive feature of bagging.

Another attractive feature of bagged decision trees is the proximity matrix. Every time two observations land on the same leaf of a tree, their proximity increases by one. For normalization, sum these proximities over all trees in the ensemble and divided by the number of trees. The resulting matrix is symmetric with diagonal elements equal to 1 and off-diagonal elements ranging from 0 to 1. You can use this matrix for finding outlier observations and discovering clusters in the data through multidimensional scaling.

The following examples showcase the Statistics Toolbox TreeBagger class and CompactTreeBagger class functionalities.

Examples

The following examples show how to use ensembles of decision trees for regression and classification.

Regression of Insurance Risk Rating for Car Imports

In this example, use a database of 1985 car imports with 205 observations, 25 input variables, and one response variable, insurance risk rating, or "symboling". The first 15 variables are numeric and the last 10 are categorical. The symboling index takes integer values from -3 to 3.

First, load the dataset and split it into predictor and response arrays:

load imports-85;
Y = X(:,1);
X = X(:,2:end);

Because bagging uses randomized data drawings, its exact outcome depends on the initial random seed. To reproduce the exact results in this example, use the random stream settings

s = RandStream('mt19937ar','seed',1945);
RandStream.setDefaultStream(s);

Finding the Optimal Leaf Size

For regression, the general rule is to set leaf size to 5 and select one third of input features for decision splits at random. In the following step, verify the optimal leaf size by comparing mean squared errors obtained by regression for various leaf sizes. oobError computes MSE versus the number of grown trees. It is necessary to set oobpred to 'on' to obtain out-of-bag predictions later.

leaf = [1 5 10 20 50 100];
col = 'rgbcmy';
figure(1);
for i=1:length(leaf)
    b = TreeBagger(50,X,Y,'method','r','oobpred','on',...
			'cat',16:25,'minleaf',leaf(i));
    plot(oobError(b),col(i));
    hold on;
end
xlabel('Number of Grown Trees');
ylabel('Mean Squared Error');
legend({'1' '5' '10' '20' '50' '100'},'Location','NorthEast');
hold off;

The red (leaf size 1) curve gives the lowest MSE values.

Estimating Feature Importance

In practical applications, you typically grow ensembles with hundreds of trees. Only 50 trees were used in the previous exercise for faster processing. Now that you have estimated the optimal leaf size, grow a larger ensemble with 100 trees and use it for estimation of feature importance:

b = TreeBagger(100,X,Y,'method','r','oobvarimp','on',...
'cat',16:25,'minleaf',1);

Inspect the error curve again to make sure nothing went wrong during training:

figure(2);
plot(oobError(b));
xlabel('Number of Grown Trees');
ylabel('Out-of-Bag Mean Squared Error');

Prediction ability should depend more on important features and less on unimportant features. You can use this idea to measure feature importance.

For each feature, you can permute the values of this feature across all of the observations in the data set and measure how much worse the mean squared error (MSE) becomes after the permutation. You can repeat this for each feature.

Using the following code, plot the increase in MSE due to permuting out-of-bag observations across each input variable. The OOBPermutedVarDeltaError array stores the increase in MSE averaged over all trees in the ensemble and divided by the standard deviation taken over the trees, for each variable. The larger this value, the more important the variable. Imposing an arbitrary cutoff at 0.65, you can select the five most important features.

figure(3);
bar(b.OOBPermutedVarDeltaError);
xlabel('Feature Number');
ylabel('Out-Of-Bag Feature Importance');
idxvar = find(b.OOBPermutedVarDeltaError>0.65)

idxvar =

     2     4     8    16    19

The OOBIndices property of TreeBagger keeps track of which observations are out of bag for what trees. Using this property, you can monitor the fraction of observations in the training data that are in bag for all trees. The curve starts at approximately 2/3, the fraction of unique observations selected by one bootstrap replica, and goes down to zero at approximately 10 trees.

finbag = zeros(1,b.NTrees);
for t=1:b.NTrees
    finbag(t) = sum(all(~b.OOBIndices(:,1:t),2));
end
finbag = finbag / size(X,1);
figure(4);
plot(finbag);
xlabel('Number of Grown Trees');
ylabel('Fraction of in-Bag Observations');

Growing Trees on a Reduced Set of Features

Using just the five most powerful features selected in Estimating Feature Importance, find out if it is possible to obtain a similar predictive power. To begin, grow 100 trees on these features only. The first three of the five selected features are numeric and the last two are categorical.

b5v = TreeBagger(100,X(:,idxvar),Y,'method','r',...
'oobvarimp','on','cat',4:5,'minleaf',1);
figure(5);
plot(oobError(b5v));
xlabel('Number of Grown Trees');
ylabel('Out-of-Bag Mean Squared Error');
figure(6);
bar(b5v.OOBPermutedVarDeltaError);
xlabel('Feature Index');
ylabel('Out-of-Bag Feature Importance');

These five most powerful features give the same MSE as the full set, and the ensemble trained on the reduced set ranks these features similarly to each other. Features 1 and 2 from the reduced set perhaps could be removed without a significant loss in the predictive power.

Finding Outliers

To find outliers in the training data, compute the proximity matrix using fillProximities:

b5v = fillProximities(b5v);

The method normalizes this measure by subtracting the mean outlier measure for the entire sample, taking the magnitude of this difference and dividing the result by the median absolute deviation for the entire sample.

figure(7);
hist(b5v.OutlierMeasure);
xlabel('Outlier Measure');
ylabel('Number of Observations');

Discovering Clusters in the Data

By applying multidimensional scaling to the computed matrix of proximities, you can inspect the structure of the input data and look for possible clusters of observations. The mdsProx method returns scaled coordinates and eigenvalues for the computed proximity matrix. If run with the colors option, this method makes a scatter plot of two scaled coordinates, first and second by default.

figure(8);
[~,e] = mdsProx(b5v,'colors','k');
xlabel('1st Scaled Coordinate');
ylabel('2nd Scaled Coordinate');

Assess the relative importance of the scaled axes by plotting the first 20 eigenvalues:

figure(9);
bar(e(1:20));
xlabel('Scaled Coordinate Index');
ylabel('Eigenvalue');

Saving the Ensemble Configuration for Future Use

To use the trained ensemble for predicting the response on unseen data, store the ensemble to disk and retrieve it later. If you do not want to compute predictions for out-of-bag data or reuse training data in any other way, there is no need to store the ensemble object itself. Saving the compact version of the ensemble would be enough in this case. Extract the compact object from the ensemble:

c = compact(b5v)

c = 

Ensemble with 100 decision trees:
Method:           regression
Nvars:                    5

This object can be now saved into a *.mat file as usual.

Classifying Radar Returns for Ionosphere Data

You can also use ensembles of decision trees for classification. For this example, use ionosphere data with 351 observations and 34 real-valued predictors. The response variable is categorical with two levels:

The goal is to predict good or bad using a set of 34 measurements.

The workflow is similar to the one for Regression of Insurance Risk Rating for Car Imports. Again, fix the initial random seed, grow 50 trees, inspect how the ensemble error changes with accumulation of trees, and estimate feature importance. For classification, it is best to set the minimal leaf size to 1 and select the square root of the total number of features for each decision split at random. These are the default settings for a TreeBagger used for classification.

load ionosphere;
s = RandStream('mt19937ar','seed',1945);
RandStream.setDefaultStream(s);
b = TreeBagger(50,X,Y,'oobvarimp','on');
figure(10);
plot(oobError(b));
xlabel('Number of Grown Trees');
ylabel('Out-of-Bag Classification Error');

The method trains ensembles with few trees on observations that are in bag for all trees. For such observations, it is not possible to compute the true out-of-bag prediction and TreeBagger returns the most probable class for classification and the sample mean for regression. You can change the default value returned for in-bag observations using the DefaultYfit property. If you set the default value to an empty string for classification, the method excludes in-bag observations from computation of the out-of-bag error. In this case, the curve is more variable when the number of trees is small, either because some observations are never out of bag (and are therefore excluded) or because their predictions are based on few trees.

b.DefaultYfit = '';
figure(11);
plot(oobError(b));
xlabel('Number of Grown Trees');
ylabel('Out-of-Bag Error Excluding in-Bag Observations');

The OOBIndices property of TreeBagger keeps track of which observations are out of bag for what trees. Using this property, you can monitor the fraction of observations in the training data that are in bag for all trees. The curve starts at approximately 2/3, the fraction of unique observations selected by one bootstrap replica, and goes down to zero at approximately 10 trees.

finbag = zeros(1,b.NTrees);
for t=1:b.NTrees
    finbag(t) = sum(all(~b.OOBIndices(:,1:t),2));
end
finbag = finbag / size(X,1);
figure(12);
plot(finbag);
xlabel('Number of Grown Trees');
ylabel('Fraction of in-Bag Observations');

Now estimate feature importance:

figure(13);
bar(b.OOBPermutedVarDeltaError);
xlabel('Feature Index');
ylabel('Out-of-Bag Feature Importance');
idxvar = find(b.OOBPermutedVarDeltaError>0.8)

idxvar =

     3     4     5     7     8

Having selected the five most important features, grow a larger ensemble on the reduced feature set. Save time by not permuting out-of-bag observations to obtain new estimates of feature importance for the reduced feature set (set oobvarimp to 'off'). You would still be interested in obtaining out-of-bag estimates of classification error (set oobpred to 'on').

b5v = TreeBagger(100,X(:,idxvar),Y,'oobpred','on');
figure(14);
plot(oobError(b5v));
xlabel('Number of Grown Trees');
ylabel('Out-of-Bag Classification Error');

For classification ensembles, in addition to classification error (fraction of misclassified observations), you can also monitor the average classification margin. For each observation, the margin is defined as the difference between the score for the true class and the maximal score for other classes predicted by this tree. The cumulative classification margin uses the scores averaged over all trees and the mean cumulative classification margin is the cumulative margin averaged over all observations. The oobMeanMargin method with the 'mode' argument set to 'cumulative' (default) shows how the mean cumulative margin changes as the ensemble grows: every new element in the returned array represents the cumulative margin obtained by including a new tree in the ensemble. If training is successful, you would expect to see a gradual increase in the mean classification margin.

For decision trees, a classification score is the probability of observing an instance of this class in this tree leaf. For example, if the leaf of a grown decision tree has five 'good' and three 'bad' training observations in it, the scores returned by this decision tree for any observation fallen on this leaf are 5/8 for the 'good' class and 3/8 for the 'bad' class. These probabilities are referred to as 'scores' for consistency with other classifiers that may not have an obvious interpretation for numeric values of returned predictions.

figure(15);
plot(oobMeanMargin(b5v));
xlabel('Number of Grown Trees');
ylabel('Out-of-Bag Mean Classification Margin');

Again, compute the matrix of proximities and look at the distribution of outlier measures. Unlike regression, outlier measures for classification ensembles are computed within each class separately.

b5v = fillProximities(b5v);
figure(16);
hist(b5v.OutlierMeasure);
xlabel('Outlier Measure');
ylabel('Number of Observations');

All extreme outliers for this dataset come from the 'good' class:

b5v.Y(b5v.OutlierMeasure>40)

ans = 

    'g'
    'g'
    'g'
    'g'
    'g''

Just like for regression, you can plot scaled coordinates, displaying the two classes in different colors using the colors argument of mdsProx. This argument takes a string in which every character represents a color. To find out the order of classes used by the ensemble, look at the ClassNames property:

b5v.ClassNames

ans = 

    'g'
    'b'

The 'good' class is first and the 'bad' class is second. Display scaled coordinates using red for 'good' and blue for 'bad' observations.

figure(17);
[s,e] = mdsProx(b5v,'colors','rb');
xlabel('1st Scaled Coordinate');
ylabel('2nd Scaled Coordinate');
legend({'good' 'bad'},'Location','NorthWest');

Again, plot the first 20 eigenvalues obtained by scaling. The first eigenvalue in this case clearly dominates and the first scaled coordinate is most important.

figure(18);
bar(e(1:20));
xlabel('Scaled Coordinate Index');
ylabel('Eigenvalue');

Plotting a Performance Curve

Another way of exploring the performance of a classification ensemble is to plot its Receiver Operating Characteristic (ROC) curve or another performance curve suitable for the current problem. First, you need to obtain predictions for out-of-bag observations. For a classification ensemble, the oobPredict method returns a cell array of classification labels ('g' or 'b' for ionosphere data) as the first output argument and a numeric array of scores as the second output argument. The returned array of scores has two columns, one for each class. In this case, the first column is for the 'good' class and the second column is for the 'bad' class. One of the columns in the score matrix is redundant because the scores represent class probabilities in tree leaves and by definition add up to 1.

[Yfit,Sfit] = oobPredict(b5v);

You can now use the perfcurve utility (described in more detail in Performance Curves) to compute a performance curve. By default, perfcurve returns the standard ROC curve, which is the true positive rate versus false positive rate. perfcurve requires true class labels, scores, and the positive class label for input. In this case, choose the 'good' class as positive. The scores for this class are in the first column of Sfit.

[fpr,tpr] = perfcurve(b5v.Y,Sfit(:,1),'g');
figure(19);
plot(fpr,tpr);
xlabel('False Positive Rate');
ylabel('True Positive Rate');

Instead of the standard ROC curve, you might want to plot, for example, ensemble accuracy versus threshold on the score for the 'good' class. The ycrit input argument of perfcurve lets you specify the criterion for the y-axis, and the third output argument of perfcurve returns an array of thresholds for the positive class score. Accuracy is the fraction of correctly classified observations, or equivalently, one minus classification error.

[fpr,accu,thre] = perfcurve(b5v.Y,Sfit(:,1),'g','ycrit','accu');
figure(20);
plot(thre,accu);
xlabel('Threshold for ''good'' Returns');
ylabel('Classification Accuracy');

The curve shows a flat region indicating that any threshold from 0.2 to 0.6 is a reasonable choice. By default, the function assigns classification labels using 0.5 as the boundary between the two classes. You can find exactly what accuracy this corresponds to:

i50 = find(accu>=0.50,1,'first')
accu(abs(thre-0.5)<eps)

returns

i50 = 

    2

ans =

    0.9430

The maximal accuracy is a little higher than the default one.

[maxaccu,iaccu] = max(accu)

returns

maxaccu =

    0.9459


iaccu =

    92

The optimal threshold is therefore:

thre(iaccu)

ans =

    0.5056

See Also

bootstrp, classify, ClassNames, CompactTreeBagger, fillProximities, perfcurve, randstream, TreeBagger, oobError, OOBPermutedVarDeltaError, perfcurve

  


Recommended Products

Includes the most popular MATLAB recorded presentations with Q&A sessions led by MATLAB experts.

 © 1984-2009- The MathWorks, Inc.    -   Site Help   -   Patents   -   Trademarks   -   Privacy Policy   -   Preventing Piracy   -   RSS