Documentation |
You have several methods for melding results from many weak learners into one high-quality ensemble predictor. These methods closely follow the same syntax, so you can try different methods with minor changes in your commands.
Create an ensemble with the fitensemble function. Its syntax is
ens = fitensemble(X,Y,model,numberens,learners)
X is the matrix of data. Each row contains one observation, and each column contains one predictor variable.
Y is the vector of responses, with the same number of observations as the rows in X.
model is a string naming the type of ensemble.
numberens is the number of weak learners in ens from each element of learners. So the number of elements in ens is numberens times the number of elements in learners.
learners is either a string naming a weak learner, a weak learner template, or a cell array of such templates.
Pictorially, here is the information you need to create an ensemble:
For all classification or nonlinear regression problems, follow these steps to create an ensemble:
All supervised learning methods start with a data matrix, usually called X in this documentation. Each row of X represents one observation. Each column of X represents one variable, or predictor.
You can use a wide variety of data types for response data.
For regression ensembles, Y must be a numeric vector with the same number of elements as the number of rows of X.
For classification ensembles, Y can be any of the following data types. This table also contains the method of including missing entries.
Data Type | Missing Entry |
---|---|
Numeric vector | NaN |
Categorical vector | <undefined> |
Character array | Row of spaces |
Cell array of strings | '' |
Logical vector | (not possible to represent) |
fitensemble ignores missing values in Y when creating an ensemble.
For example, suppose your response data consists of three observations in the following order: true, false, true. You could express Y as:
[1;0;1] (numeric vector)
nominal({'true','false','true'}) (categorical vector)
[true;false;true] (logical vector)
['true ';'false';'true '] (character array, padded with spaces so each row has the same length)
{'true','false','true'} (cell array of strings)
Use whichever data type is most convenient. Because you cannot represent missing values with logical entries, do not use logical entries when you have missing values in Y.
fitensemble uses one of these algorithms to create an ensemble.
For classification with two classes:
'AdaBoostM1'
'LogitBoost'
'GentleBoost'
'RobustBoost' (requires an Optimization Toolbox™ license)
'LPBoost' (requires an Optimization Toolbox license)
'TotalBoost' (requires an Optimization Toolbox license)
'RUSBoost'
'Subspace'
'Bag'
For classification with three or more classes:
'AdaBoostM2'
'LPBoost' (requires an Optimization Toolbox license)
'TotalBoost' (requires an Optimization Toolbox license)
'RUSBoost'
'Subspace'
'Bag'
For regression:
'LSBoost'
'Bag'
'Bag' applies to all methods. When using 'Bag', indicate whether you want a classifier or regressor with the type name-value pair set to 'classification' or 'regression'.
For descriptions of the various algorithms, see Ensemble Algorithms.
See Suggestions for Choosing an Appropriate Ensemble Algorithm.
This table lists characteristics of the various algorithms. In the table titles:
Regress. — Regression
Classif. — Classification
Preds. — Predictors
Imbalance — Good for imbalanced data (one class has many more observations than the other)
Stop — Algorithm self-terminates
Sparse — Requires fewer weak learners than other ensemble algorithms
Algorithm | Regress. | Binary Classif. | Binary Classif. Multi- Level Preds. | Classif. 3+ Classes | Class Imbalance | Stop | Sparse |
---|---|---|---|---|---|---|---|
Bag | × | × | × | ||||
AdaBoostM1 | × | ||||||
AdaBoostM2 | × | ||||||
LogitBoost | × | × | |||||
GentleBoost | × | × | |||||
RobustBoost | × | ||||||
LPBoost | × | × | × | × | |||
TotalBoost | × | × | × | × | |||
RUSBoost | × | × | × | ||||
LSBoost | × | ||||||
Subspace | × | × |
RobustBoost, LPBoost, and TotalBoost require an Optimization Toolbox license. Try TotalBoost before LPBoost, as TotalBoost can be more robust.
Suggestions for Choosing an Appropriate Ensemble Algorithm.
Regression — Your choices are LSBoost or Bag. See General Characteristics of Ensemble Algorithms for the main differences between boosting and bagging.
Binary Classification — Try AdaBoostM1 first, with these modifications:
Data Characteristic | Recommended Algorithm |
---|---|
Many predictors | Subspace |
Skewed data (many more observations of one class) | RUSBoost |
Categorical predictors with over 31 levels | LogitBoost or GentleBoost |
Label noise (some training data has the wrong class) | RobustBoost |
Many observations | Avoid LPBoost, TotalBoost, and Bag |
Multiclass Classification — Try AdaBoostM2 first, with these modifications:
Data Characteristic | Recommended Algorithm |
---|---|
Many predictors | Subspace |
Skewed data (many more observations of one class) | RUSBoost |
Many observations | Avoid LPBoost, TotalBoost, and Bag |
For details of the algorithms, see Ensemble Algorithms.
General Characteristics of Ensemble Algorithms.
Bag generally constructs deep trees. This construction is both time consuming and memory-intensive. This also leads to relatively slow predictions.
Boost algorithms generally use very shallow trees. This construction uses relatively little time or memory. However, for effective predictions, boosted trees might need more ensemble members than bagged trees. Therefore it is not always clear which class of algorithms is superior.
Bag can estimate the generalization error without additional cross validation. See oobLoss.
Except for Subspace, all boosting and bagging algorithms are based on tree learners. Subspace can use either discriminant analysis or k-nearest neighbor learners.
For details of the characteristics of individual ensemble members, see Characteristics of Algorithms.
Choosing the size of an ensemble involves balancing speed and accuracy.
Larger ensembles take longer to train and to generate predictions.
Some ensemble algorithms can become overtrained (inaccurate) when too large.
To set an appropriate size, consider starting with several dozen to several hundred members in an ensemble, training the ensemble, and then checking the ensemble quality, as in Test Ensemble Quality. If it appears that you need more members, add them using the resume method (classification) or the resume method (regression). Repeat until adding more members does not improve ensemble quality.
Tip For classification, the LPBoost and TotalBoost algorithms are self-terminating, meaning you do not have to investigate the appropriate ensemble size. Try setting numberens to 500. The algorithms usually terminate with fewer members. |
Currently the weak learner types are:
'Discriminant' (recommended for Subspace ensemble)
'KNN' (only for Subspace ensemble)
'Tree' (for any ensemble except Subspace)
There are two ways to set the weak learner type in the ensemble:
To create an ensemble with default weak learner options, pass in the string as the weak learner. For example,
ens = fitensemble(X,Y,'AdaBoostM2',50,'Tree'); % or ens = fitensemble(X,Y,'Subspace',50,'KNN');
To create an ensemble with nondefault weak learner options, create a nondefault weak learner using the appropriate template method. For example, if you have missing data, and want to use trees with surrogate splits for better accuracy:
templ = templateTree('Surrogate','all'); ens = fitensemble(X,Y,'AdaBoostM2',50,templ);
To grow trees with roughly 10 leaf nodes per tree:
templ = templateTree('MinLeaf',size(X,1)/10); ens = fitensemble(X,Y,'AdaBoostM2',50,templ);
While you can give fitensemble a cell array of learner templates, the most common usage is to give just one weak learner template.
For examples using a template, see Example: Unequal Classification Costs and Surrogate Splits.
Decision trees can handle NaN values in X. Such values are called "missing." If you have some missing values in a row of X, a decision tree finds optimal splits using nonmissing values only. If an entire row consists of NaN, fitensemble ignores that row. If you have data with a large fraction of missing values in X, use surrogate decision splits. For examples of surrogate splits, see Example: Unequal Classification Costs and Surrogate Splits.
Common Settings for Tree Weak Learners.
The depth of a weak learner tree makes a difference for training time, memory usage, and predictive accuracy. You control the depth with two parameters:
MinLeaf — Each leaf has at least MinLeaf observations. Set small values of MinLeaf to get a deep tree.
MinParent — Each branch node in the tree has at least MinParent observations. Set small values of MinParent to get a deep tree.
If you supply both MinParent and MinLeaf, the learner uses the setting that gives larger leaves:
MinParent = max(MinParent,2*MinLeaf)
Surrogate — Grow decision trees with surrogate splits when Surrogate is 'on'. Use surrogate splits when your data has missing values.
The syntax of fitensemble is
ens = fitensemble(X,Y,model,numberens,learners)
X is the matrix of data. Each row contains one observation, and each column contains one predictor variable.
Y is the responses, with the same number of observations as rows in X.
model is a string naming the type of ensemble.
numberens is the number of weak learners in ens from each element of learners. So the number of elements in ens is numberens times the number of elements in learners.
learners is a string naming a weak learner, a weak learner template, or a cell array of such strings and templates.
The result of fitensemble is an ensemble object, suitable for making predictions on new data. For a basic example of creating a classification ensemble, see Create a Classification Ensemble. For a basic example of creating a regression ensemble, see Creating a Regression Ensemble.
Where to Set Name-Value Pairs. There are several name-value pairs you can pass to fitensemble, and several that apply to the weak learners (templateDiscriminant, templateKNN, templateTree and templateTree). To determine which option (name-value pair) is appropriate, the ensemble or the weak learner:
Use template name-value pairs to control the characteristics of the weak learners.
Use fitensemble name-value pairs to control the ensemble as a whole, either for algorithms or for structure.
For example, to have an ensemble of boosted classification trees with each tree deeper than the default, set the templateTree name-value pairs (MinLeaf and MinParent) to smaller values than the defaults. This causes the trees to be leafier (deeper).
To name the predictors in the ensemble (part of the structure of the ensemble), use the PredictorNames name-value pair in fitensemble.
Create a classification tree ensemble for the Fisher iris data, and use it to predict the classification of a flower with average measurements.
Load the data:
load fisheriris
The predictor data X is the meas matrix.
The response data Y is the species cell array.
For classification trees with three or more classes, Suggestions for Choosing an Appropriate Ensemble Algorithm suggests using an 'AdaBoostM2' ensemble.
For this example, arbitrarily take an ensemble of 100 trees.
Use a default tree template.
Create the ensemble:
ens = fitensemble(meas,species,'AdaBoostM2',100,'Tree')
ens = classreg.learning.classif.ClassificationEnsemble PredictorNames: {'x1' 'x2' 'x3' 'x4'} ResponseName: 'Y' ClassNames: {'setosa' 'versicolor' 'virginica'} ScoreTransform: 'none' NumObservations: 150 NumTrained: 100 Method: 'AdaBoostM2' LearnerNames: {'Tree'} ReasonForTermination: [1x77 char] FitInfo: [100x1 double] FitInfoDescription: {2x1 cell} Properties, Methods
Predict the classification of a flower with average measurements:
flower = predict(ens,mean(meas))
flower = 'versicolor'
Create a regression ensemble to predict mileage of cars based on their horsepower and weight, trained on the carsmall data. Use the resulting ensemble to predict the mileage of a car with 150 horsepower weighing 2750 lbs.
Load the data:
load carsmall
Prepare the input data.
X = [Horsepower Weight];
The response data Y is MPG.
The only boosted regression ensemble type is 'LSBoost'.
For this example, arbitrarily take an ensemble of 100 trees.
Use a default tree template.
Create the ensemble:
ens = fitensemble(X,MPG,'LSBoost',100,'Tree')
ens = classreg.learning.regr.RegressionEnsemble PredictorNames: {'x1' 'x2'} ResponseName: 'Y' ResponseTransform: 'none' NumObservations: 94 NumTrained: 100 Method: 'LSBoost' LearnerNames: {'Tree'} ReasonForTermination: [1x77 char] FitInfo: [100x1 double] FitInfoDescription: {2x1 cell} Regularization: [] Properties, Methods
Predict the mileage of a car with 150 horsepower weighing 2750 lbs:
mileage = ens.predict([150 2750])
mileage = 22.4180
Usually you cannot evaluate the predictive quality of an ensemble based on its performance on training data. Ensembles tend to "overtrain," meaning they produce overly optimistic estimates of their predictive power. This means the result of resubLoss for classification (resubLoss for regression) usually indicates lower error than you get on new data.
To obtain a better idea of the quality of an ensemble, use one of these methods:
Evaluate the ensemble on an independent test set (useful when you have a lot of training data).
Evaluate the ensemble by cross validation (useful when you don't have a lot of training data).
Evaluate the ensemble on out-of-bag data (useful when you create a bagged ensemble with fitensemble).
This example uses a bagged ensemble so it can use all three methods of evaluating ensemble quality.
Generate an artificial dataset with 20 predictors. Each entry is a random number from 0 to 1. The initial classification is if and otherwise.
rng(1,'twister') % for reproducibility X = rand(2000,20); Y = sum(X(:,1:5),2) > 2.5;
In addition, to add noise to the results, randomly switch 10% of the classifications:
idx = randsample(2000,200); Y(idx) = ~Y(idx);
Create independent training and test sets of data. Use 70% of the data for a training set by calling cvpartition using the holdout option:
cvpart = cvpartition(Y,'holdout',0.3);
Xtrain = X(training(cvpart),:);
Ytrain = Y(training(cvpart),:);
Xtest = X(test(cvpart),:);
Ytest = Y(test(cvpart),:);
Create a bagged classification ensemble of 200 trees from the training data:
bag = fitensemble(Xtrain,Ytrain,'Bag',200,'Tree',... 'Type','Classification')
bag = classreg.learning.classif.ClassificationBaggedEnsemble PredictorNames: {1x20 cell} ResponseName: 'Y' ClassNames: [0 1] ScoreTransform: 'none' NumObservations: 1400 NumTrained: 200 Method: 'Bag' LearnerNames: {'Tree'} ReasonForTermination: 'Terminated normally after completing the reques...' FitInfo: [] FitInfoDescription: 'None' FResample: 1 Replace: 1 UseObsForLearner: [1400x200 logical]
Plot the loss (misclassification) of the test data as a function of the number of trained trees in the ensemble:
figure; plot(loss(bag,Xtest,Ytest,'mode','cumulative')); xlabel('Number of trees'); ylabel('Test classification error');
Generate a five-fold cross-validated bagged ensemble:
cv = fitensemble(X,Y,'Bag',200,'Tree',... 'type','classification','kfold',5)
cv = classreg.learning.partition.ClassificationPartitionedEnsemble CrossValidatedModel: 'Bag' PredictorNames: {1x20 cell} ResponseName: 'Y' NumObservations: 2000 KFold: 5 Partition: [1x1 cvpartition] NumTrainedPerFold: [200 200 200 200 200] ClassNames: [0 1] ScoreTransform: 'none'
Examine the cross-validation loss as a function of the number of trees in the ensemble:
figure; plot(loss(bag,Xtest,Ytest,'mode','cumulative')); hold on; plot(kfoldLoss(cv,'mode','cumulative'),'r.'); hold off; xlabel('Number of trees'); ylabel('Classification error'); legend('Test','Cross-validation','Location','NE');
Cross validating gives comparable estimates to those of the independent set.
Generate the loss curve for out-of-bag estimates, and plot it along with the other curves:
figure; plot(loss(bag,Xtest,Ytest,'mode','cumulative')); hold on; plot(kfoldLoss(cv,'mode','cumulative'),'r.'); plot(oobLoss(bag,'mode','cumulative'),'k--'); hold off; xlabel('Number of trees'); ylabel('Classification error'); legend('Test','Cross-validation','Out of bag','Location','NE');
The out-of-bag estimates are again comparable to those of the other methods.
This example shows how to classify when one class has many more observations than another. Try the RUSBoost algorithm first, because it is designed to handle this case.
This example uses the "Cover type" data from the UCI machine learning archive, described in http://archive.ics.uci.edu/ml/datasets/Covertype. The data classifies types of forest (ground cover), based on predictors such as elevation, soil type, and distance to water. The data has over 500,000 observations and over 50 predictors, so training and using a classifier is time consuming.
Blackard and Dean [1] describe a neural net classification of this data. They quote a 70.6% classification accuracy. RUSBoost obtains over 76% classification accuracy; see steps 6 and 7.
Step 1. Obtain the data.
urlwrite('http://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz','forestcover.gz');
Then, extract the data from the forestcover.gz file. The data is in the covtype.data file.
Step 2. Import the data and prepare it for classification.
Import the data into your workspace. Extract the last data column into a variable named Y.
load covtype.data
Y = covtype(:,end);
covtype(:,end) = [];
Step 3. Examine the response data.
tabulate(Y)
Value Count Percent 1 211840 36.46% 2 283301 48.76% 3 35754 6.15% 4 2747 0.47% 5 9493 1.63% 6 17367 2.99% 7 20510 3.53%
There are hundreds of thousands of data points. Those of class 4 are less than 0.5% of the total. This imbalance indicates that RUSBoost is an appropriate algorithm.
Step 4. Partition the data for quality assessment.
Use half the data to fit a classifier, and half to examine the quality of the resulting classifier.
part = cvpartition(Y,'holdout',0.5); istrain = training(part); % data for fitting istest = test(part); % data for quality assessment tabulate(Y(istrain))
Value Count Percent 1 105920 36.46% 2 141651 48.76% 3 17877 6.15% 4 1374 0.47% 5 4746 1.63% 6 8683 2.99% 7 10255 3.53%
Step 5. Create the ensemble.
Use deep trees for higher ensemble accuracy. To do so, set the trees to have minimal leaf size of 5. Set LearnRate to 0.1 in order to achieve higher accuracy as well. The data is large, and, with deep trees, creating the ensemble is time consuming.
t = templateTree('minleaf',5); tic rusTree = fitensemble(covtype(istrain,:),Y(istrain),'RUSBoost',1000,t,... 'LearnRate',0.1,'nprint',100); toc
Training RUSBoost... Grown weak learners: 100 Grown weak learners: 200 Grown weak learners: 300 Grown weak learners: 400 Grown weak learners: 500 Grown weak learners: 600 Grown weak learners: 700 Grown weak learners: 800 Grown weak learners: 900 Grown weak learners: 1000 Elapsed time is 918.258401 seconds.
Step 6. Inspect the classification error.
Plot the classification error against the number of members in the ensemble.
figure; tic plot(loss(rusTree,covtype(istest,:),Y(istest),'mode','cumulative')); toc grid on; xlabel('Number of trees'); ylabel('Test classification error');
Elapsed time is 775.646935 seconds.
The ensemble achieves a classification error of under 24% using 150 or more trees. It achieves the lowest error for 400 or more trees.
Examine the confusion matrix for each class as a percentage of the true class.
tic Yfit = predict(rusTree,covtype(istest,:)); toc tab = tabulate(Y(istest)); bsxfun(@rdivide,confusionmat(Y(istest),Yfit),tab(:,2))*100
Elapsed time is 427.293168 seconds. ans = Columns 1 through 6 83.3771 7.4056 0.0736 0 1.7051 0.2681 18.3156 66.4652 2.1193 0.0162 9.3435 2.8239 0 0.0839 90.8038 2.3885 0.6545 6.0693 0 0 2.4763 95.8485 0 1.6752 0 0.2739 0.6530 0 98.6518 0.4213 0 0.1036 3.8346 1.1400 0.4030 94.5187 0.2340 0 0 0 0.0195 0 Column 7 7.1705 0.9163 0 0 0 0 99.7465
All classes except class 2 have over 80% classification accuracy, and classes 3 through 7 have over 90% accuracy. But class 2 makes up close to half the data, so the overall accuracy is not that high.
Step 7. Compact the ensemble.
The ensemble is large. Remove the data using the compact method.
cmpctRus = compact(rusTree); sz(1) = whos('rusTree'); sz(2) = whos('cmpctRus'); [sz(1).bytes sz(2).bytes]
ans = 1.0e+09 * 1.6947 0.9790
The compacted ensemble is about half the size of the original.
Remove half the trees from cmpctRus. This action is likely to have minimal effect on the predictive performance, based on the observation that 400 out of 1000 trees give nearly optimal accuracy.
cmpctRus = removeLearners(cmpctRus,[500:1000]);
sz(3) = whos('cmpctRus');
sz(3).bytes
ans = 475495669
The reduced compact ensemble takes about a quarter the memory of the full ensemble. Its overall loss rate is under 24%:
L = loss(cmpctRus,covtype(istest,:),Y(istest))
L = 0.2326
The predictive accuracy on new data might differ, because the ensemble accuracy might be biased. The bias arises because the same data used for assessing the ensemble was used for reducing the ensemble size. To obtain an unbiased estimate of requisite ensemble size, you should use cross validation. However, that procedure is time consuming.
In many real-world applications, you might prefer to treat classes in your data asymmetrically. For example, you might have data with many more observations of one class than of any other. Or you might work on a problem in which misclassifying observations of one class has more severe consequences than misclassifying observations of another class. In such situations, you can use two optional parameters for fitensemble: prior and cost.
By using prior, you set prior class probabilities (that is, class probabilities used for training). Use this option if some classes are under- or overrepresented in your training set. For example, you might obtain your training data by simulation. Because simulating class A is more expensive than class B, you opt to generate fewer observations of class A and more observations of class B. You expect, however, that class A and class B are mixed in a different proportion in the real world. In this case, set prior probabilities for class A and B approximately to the values you expect to observe in the real world. fitensemble normalizes prior probabilities to make them add up to 1; multiplying all prior probabilities by the same positive factor does not affect the result of classification.
If classes are adequately represented in the training data but you want to treat them asymmetrically, use the cost parameter. Suppose you want to classify benign and malignant tumors in cancer patients. Failure to identify a malignant tumor (false negative) has far more severe consequences than misidentifying benign as malignant (false positive). You should assign high cost to misidentifying malignant as benign and low cost to misidentifying benign as malignant.
You must pass misclassification costs as a square matrix with nonnegative elements. Element C(i,j) of this matrix is the cost of classifying an observation into class j if the true class is i. The diagonal elements C(i,i) of the cost matrix must be 0. For the previous example, you can choose malignant tumor to be class 1 and benign tumor to be class 2. Then you can set the cost matrix to
$$\left[\begin{array}{cc}0& c\\ 1& 0\end{array}\right]$$
where c > 1 is the cost of misidentifying a malignant tumor as benign. Costs are relative—multiplying all costs by the same positive factor does not affect the result of classification.
If you have only two classes, fitensemble adjusts their prior probabilities using ${\tilde{P}}_{i}={C}_{ij}{P}_{i}$for class i = 1,2 and j ≠ i. P_{i} are prior probabilities either passed into fitensemble or computed from class frequencies in the training data, and $${\tilde{P}}_{i}$$ are adjusted prior probabilities. Then fitensemble uses the default cost matrix
$$\left[\begin{array}{cc}0& 1\\ 1& 0\end{array}\right]$$
and these adjusted probabilities for training its weak learners. Manipulating the cost matrix is thus equivalent to manipulating the prior probabilities.
If you have three or more classes, fitensemble also converts input costs into adjusted prior probabilities. This conversion is more complex. First, fitensemble attempts to solve a matrix equation described in Zhou and Liu [20]. If it fails to find a solution, fitensemble applies the "average cost" adjustment described in Breiman et al. [6]. For more information, see Zadrozny, Langford, and Abe [19].
This example uses data on patients with hepatitis to see if they live or die as a result of the disease. The data set is described at http://archive.ics.uci.edu/ml/datasets/Hepatitis.
Read the hepatitis data set from the UCI repository as a character array. Then convert the result to a cell array of strings using textscan. Specify a cell array of strings containing the variable names.
hepatitis = textscan(urlread(['http://archive.ics.uci.edu/ml/' ... 'machine-learning-databases/hepatitis/hepatitis.data']),... '%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f','TreatAsEmpty','?',... 'Delimiter',','); size(hepatitis) VarNames = {'dieOrLive' 'age' 'sex' 'steroid' 'antivirals' 'fatigue' ... 'malaise' 'anorexia' 'liverBig' 'liverFirm' 'spleen' ... 'spiders' 'ascites' 'varices' 'bilirubin' 'alkPhosphate' 'sgot' ... 'albumin' 'protime' 'histology'};
ans = 1 20
hepatitis is a 1-by-20 cell array of strings. The cells correspond to the response (liveOrDie) and 19 heterogeneous predictors.
Specify a numeric matrix containing the predictors and a cell vector containing the strings 'Die' and 'Live', which are response categories. The response contains two values: 1 indicates that a patient died, and 2 indicates that a patient lived. Specify a cell vector of strings for the response using the response categories. The first variable in hepatitis contains the response.
X = cell2mat(hepatitis(2:end)); ClassNames = {'Die' 'Live'}; Y = ClassNames(hepatitis{:,1});
X is a numeric matrix containing the 19 predictors. Y is a cell array of strings containing the response.
Inspect the data for missing values.
figure; barh(sum(isnan(X),1)/size(X,1)); set(gca,'YTick',1:numel(VarNames)-1,'YTickLabel',VarNames(2:end)); ylabel('Predictor'); xlabel('Fraction of missing values');
Most predictors have missing values, and one has nearly 45% of the missing values. Therefore, use decision trees with surrogate splits for better accuracy. Because the data set is small, training time with surrogate splits should be tolerable.
Create a classification tree template that uses surrogate splits.
rng(0,'twister') % for reproducibility t = templateTree('surrogate','all');
Examine the data or the description of the data to see which predictors are categorical.
X(1:5,:)
ans = Columns 1 through 6 30.0000 2.0000 1.0000 2.0000 2.0000 2.0000 50.0000 1.0000 1.0000 2.0000 1.0000 2.0000 78.0000 1.0000 2.0000 2.0000 1.0000 2.0000 31.0000 1.0000 NaN 1.0000 2.0000 2.0000 34.0000 1.0000 2.0000 2.0000 2.0000 2.0000 Columns 7 through 12 2.0000 1.0000 2.0000 2.0000 2.0000 2.0000 2.0000 1.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 Columns 13 through 18 2.0000 1.0000 85.0000 18.0000 4.0000 NaN 2.0000 0.9000 135.0000 42.0000 3.5000 NaN 2.0000 0.7000 96.0000 32.0000 4.0000 NaN 2.0000 0.7000 46.0000 52.0000 4.0000 80.0000 2.0000 1.0000 NaN 200.0000 4.0000 NaN Column 19 1.0000 1.0000 1.0000 1.0000 1.0000
It appears that predictors 2 through 13 are categorical, as well as predictor 19. You can confirm this inference using the data set description at http://archive.ics.uci.edu/ml/datasets/Hepatitis.
List the categorical variables.
catIdx = [2:13,19];
Create a cross-validated ensemble using 150 learners and the GentleBoost algorithm.
Ensemble = fitensemble(X,Y,'GentleBoost',150,t,... 'PredictorNames',VarNames(2:end),'LearnRate',0.1,... 'CategoricalPredictors',catIdx,'KFold',5); figure; plot(kfoldLoss(Ensemble,'Mode','cumulative','LossFun','exponential')); xlabel('Number of trees'); ylabel('Cross-validated exponential loss');
Inspect the confusion matrix to see which patients the ensemble predicts correctly.
[yFit,sFit] = kfoldPredict(Ensemble);
confusionmat(Y,yFit,'Order',ClassNames)
ans = 18 14 11 112
Of the 123 patient who live, the ensemble predicts correctly that 112 will live. But for the 32 patients who die of hepatitis, the ensemble only predicts correctly that about half will die of hepatitis.
There are two types of error in the predictions of the ensemble:
Predicting that the patient lives, but the patient dies
Predicting that the patient dies, but the patient lives
Suppose you believe that the first error is five times worse than the second. Create a new classification cost matrix that reflects this belief.
cost.ClassNames = ClassNames; cost.ClassificationCosts = [0 5; 1 0];
Create a new cross-validated ensemble using cost as the misclassification cost, and inspect the resulting confusion matrix.
EnsembleCost = fitensemble(X,Y,'GentleBoost',150,t,... 'PredictorNames',VarNames(2:end),'LearnRate',0.1,... 'CategoricalPredictors',catIdx,'KFold',5,... 'Cost',cost); [yFitCost,sFitCost] = kfoldPredict(EnsembleCost); confusionmat(Y,yFitCost,'Order',ClassNames)
ans = 19 13 8 115
As expected, the new ensemble does a better job classifying the who die. Somewhat surprisingly, the new ensemble also does a better job classifying the who live, though the result is not statistically significantly better. The results of the cross validation are random, so this result is simply a statistical fluctuation. The result seems to indicate that the classification of who live is not very sensitive to the cost.
Generally, you cannot use classification with more than 31 levels in any categorical predictor. However, two boosting algorithms can classify data with many categorical predictor levels and binary responses: LogitBoost and GentleBoost. For details, see LogitBoost and GentleBoost.
This example uses demographic data from the U.S. Census Bureau, available at http://archive.ics.uci.edu/ml/machine-learning-databases/adult/. The objective of the researchers who posted the data is predicting whether an individual makes over $50,000 a year, based on a set of characteristics. You can see details of the data, including predictor names, in the adult.names file at the site.
Load the 'adult.data' file from the UCI Machine Learning Repository. Specify a cell array of strings containing the variable names.
adult = urlread(['http://archive.ics.uci.edu/ml/'... 'machine-learning-databases/adult/adult.data']); VarNames = {'age' 'workclass' 'fnlwgt' 'education' 'educationNum'... 'maritalStatus' 'occupation' 'relationship' 'race'... 'sex' 'capitalGain' 'capitalLoss'... 'hoursPerWeek' 'nativeCountry' 'income'};
adult.data represents missing data as '?'. Replace instances of missing data with an empty string. Use textscan to put the data into a cell array of strings.
adult = strrep(adult,'?',''); adult = textscan(adult,'%f%s%f%s%f%s%s%s%s%s%f%f%f%s%s',... 'Delimiter',',','TreatAsEmpty','');
The name-value pair argument TreatAsEmpty converts all observations corresponding to numeric variables to NaN if the observation is an empty string.
Since the variables are heterogeneous, put the set into a MATLAB^{®} table.
adult = table(adult{:},'VariableNames',VarNames);
Some categorical variables have many levels. Plot the number of levels of each categorical predictor.
cat = varfun(@iscellstr,adult(:,1:end - 1),... 'OutputFormat','uniform'); % Logical flag for categorical variables catVars = find(cat); % Indices of categorical variables countCats = @(var)numel(categories(nominal(var))); numCat = varfun(@(var)countCats(var),adult(:,catVars),... 'OutputFormat','uniform'); figure; barh(numCat); set(gca,'YTickLabel',VarNames(catVars)) ylabel('Predictor'); xlabel('Number of categories');
The anonymous function countCats converts a predictor to a nominal array, then counts the unique, nonempty categories of the predictor. Predictor 14 ('nativeCountry') has more than 40 categorical levels. For binary classification, fitctree uses a computational shortcut to find an optimal split for categorical predictors with many categories. For classification with more than two classes, you can choose a heuristic algorithm to find a good split. For details, see Splitting Categorical Predictors.
Specify the predictor matrix using classreg.regr.modelutils.predictormatrix and the response vector.
X = classreg.regr.modelutils.predictormatrix(adult,'ResponseVar',... size(adult,2)); Y = nominal(adult.income);
X is a numeric matrix; predictormatrix converts all categorical variables into group indices. The name-value pair argument ResponseVar indicates that the last column is the response variable, and excludes it from the predictor matrix. Y is a nominal, categorical array.
Train classification ensembles using both LogitBoost and GentleBoost.
rng(1); % For reproducibility LBEnsemble = fitensemble(X,Y,'LogitBoost',300,'Tree',... 'CategoricalPredictors',cat,'PredictorNames',VarNames(1:end-1),... 'ResponseName','income'); GBEnsemble = fitensemble(X,Y,'GentleBoost',300,'Tree',... 'CategoricalPredictors',cat,'PredictorNames',VarNames(1:end-1),... 'ResponseName','income');
Examine the resubstitution error for both ensembles.
figure; plot(resubLoss(LBEnsemble,'Mode','cumulative')); hold on plot(resubLoss(GBEnsemble,'Mode','cumulative'),'r--'); hold off xlabel('Number of trees'); ylabel('Resubstitution error'); legend('LogitBoost','GentleBoost','Location','NE');
The GentleBoost algorithm has a slightly smaller resubstitution error.
Estimate the generalization error for both algorithms by cross validation.
CVLBEnsemble = crossval(LBEnsemble,'KFold',5); CVGBEnsemble = crossval(GBEnsemble,'KFold',5); figure; plot(kfoldLoss(CVLBEnsemble,'Mode','cumulative')); hold on plot(kfoldLoss(CVGBEnsemble,'Mode','cumulative'),'r--'); hold off xlabel('Number of trees'); ylabel('Cross-validated error'); legend('LogitBoost','GentleBoost','Location','NE');
The cross-validated loss is nearly the same as the resubstitution error.
When you have missing data, trees and ensembles of trees give better predictions when they include surrogate splits. Furthermore, estimates of predictor importance are often different with surrogate splits. Eliminating unimportant predictors can save time and memory for predictions, and can make predictions easier to understand.
This example shows the effects of surrogate splits for predictions for data containing missing entries in the test set.
Load sample data. Partition it into a training and test set.
load ionosphere; rng(10) % for reproducibility cv = cvpartition(Y,'Holdout',0.3); Xtrain = X(training(cv),:); Ytrain = Y(training(cv)); Xtest = X(test(cv),:); Ytest = Y(test(cv));
Bag decision trees with and without surrogate splits.
b = fitensemble(Xtrain,Ytrain,'Bag',50,'Tree',... 'Type','Class'); templS = templateTree('Surrogate','On'); bs = fitensemble(Xtrain,Ytrain,'Bag',50,templS,... 'Type','Class');
Suppose half of the values in the test set are missing.
Xtest(rand(size(Xtest))>0.5) = NaN;
Test accuracy with and without surrogate splits.
figure; plot(loss(b,Xtest,Ytest,'Mode','Cumulative')); hold on; plot(loss(bs,Xtest,Ytest,'Mode','Cumulative'),'r--'); legend('Regular trees','Trees with surrogate splits'); xlabel('Number of trees'); ylabel('Test classification error');
Check the statistical significance of the difference in results with the McNemar test. Convert the labels to a nominal data type to make it easier to check for equality.
Yfit = nominal(predict(b,Xtest)); YfitS = nominal(predict(bs,Xtest)); N10 = sum(Yfit==nominal(Ytest) & YfitS~=nominal(Ytest)); N01 = sum(Yfit~=nominal(Ytest) & YfitS==nominal(Ytest)); mcnemar = (abs(N10-N01) - 1)^2/(N10+N01); pval = 1 - chi2cdf(mcnemar,1)
pval = 1.7683e-04
The extremely low p-value indicates that the ensemble with surrogate splits is better in a statistically significant manner.
This example shows how to obtain the benefits of the LPBoost and TotalBoost algorithms. These algorithms share two beneficial characteristics:
They are self-terminating, so you don't have to guess how many members to include.
They produce ensembles with some very small weights, so you can safely remove ensemble members.
Note that the algorithms in this example require an Optimization Toolbox™ license.
Load the ionosphere data set.
load ionosphere
Create the classification ensembles
Create ensembles for classifying the ionosphere data using the LPBoost, TotalBoost, and, for comparison, AdaBoostM1 algorithms. It is hard to know how many members to include in an ensemble. For LPBoost and TotalBoost, try using 500. For comparison, also use 500 for AdaBoostM1.
rng default % For reproducibility T = 500; adaStump = fitensemble(X,Y,'AdaBoostM1',T,'Tree'); totalStump = fitensemble(X,Y,'TotalBoost',T,'Tree'); lpStump = fitensemble(X,Y,'LPBoost',T,'Tree'); figure; plot(resubLoss(adaStump,'Mode','Cumulative')); hold on plot(resubLoss(totalStump,'Mode','Cumulative'),'r'); plot(resubLoss(lpStump,'Mode','Cumulative'),'g'); hold off xlabel('Number of stumps'); ylabel('Training error'); legend('AdaBoost','TotalBoost','LPBoost','Location','NE');
All three algorithms achieve perfect prediction on the training data after a while.
Examine the number of members in all three ensembles.
[adaStump.NTrained totalStump.NTrained lpStump.NTrained]
ans = 500 52 68
AdaBoostM1 trained all 500 members. The other two algorithms stopped training early.
Cross validate the ensembles to better determine ensemble accuracy.
cvlp = crossval(lpStump,'KFold',5); cvtotal = crossval(totalStump,'KFold',5); cvada = crossval(adaStump,'KFold',5); figure; plot(kfoldLoss(cvada,'Mode','Cumulative')); hold on plot(kfoldLoss(cvtotal,'Mode','Cumulative'),'r'); plot(kfoldLoss(cvlp,'Mode','Cumulative'),'g'); hold off xlabel('Ensemble size'); ylabel('Cross-validated error'); legend('AdaBoost','TotalBoost','LPBoost','Location','NE');
It appears that each boosting algorithms achieves 10% or lower loss with 50 ensemble members, and AdaBoostM1 achieves near 6% error with 150 or more ensemble members.
Compact and remove ensemble members
To reduce the ensemble sizes, compact them, and then use removeLearners. The question is, how many learners should you remove? The cross-validated loss curves give you one measure. For another, examine the learner weights for LPBoost and TotalBoost after compacting.
cada = compact(adaStump); clp = compact(lpStump); ctotal = compact(totalStump); figure subplot(2,1,1) plot(clp.TrainedWeights) title('LPBoost weights') subplot(2,1,2) plot(ctotal.TrainedWeights) title('TotalBoost weights')
Both LPBoost and TotalBoost show clear points where the ensemble member weights become negligible.
Remove the unimportant ensemble members.
cada = removeLearners(cada,150:cada.NTrained); clp = removeLearners(clp,60:clp.NTrained); ctotal = removeLearners(ctotal,40:ctotal.NTrained);
Check that removing these learners does not affect ensemble accuracy on the training data.
[loss(cada,X,Y) loss(clp,X,Y) loss(ctotal,X,Y)]
ans = 0 0 0
Check the resulting compact ensemble sizes.
s(1) = whos('cada'); s(2) = whos('clp'); s(3) = whos('ctotal'); s.bytes
ans = 543067 ans = 216603 ans = 144063
The sizes of the compact ensembles are approximately proportional to the number of members in each.
Regularization is a process of choosing fewer weak learners for an ensemble in a way that does not diminish predictive performance. Currently you can regularize regression ensembles. (You can also regularize a discriminant analysis classifier in a non-ensemble context; see Regularize a Discriminant Analysis Classifier.)
The regularize method finds an optimal set of learner weights α_{t} that minimize
$\sum}_{n=1}^{N}{w}_{n}g\left(\left({\displaystyle \sum}_{t=1}^{T}{\alpha}_{t}{h}_{t}\left({x}_{n}\right)\right),{y}_{n}\right)+\lambda {\displaystyle \sum}_{t=1}^{T}\left|{\alpha}_{t}\right|.$
Here
λ ≥ 0 is a parameter you provide, called the lasso parameter.
h_{t} is a weak learner in the ensemble trained on N observations with predictors x_{n}, responses y_{n}, and weights w_{n}.
g(f,y) = (f – y)^{2} is the squared error.
The ensemble is regularized on the same (x_{n},y_{n},w_{n}) data used for training, so
$\sum}_{n=1}^{N}{w}_{n}g\left(\left({\displaystyle \sum}_{t=1}^{T}{\alpha}_{t}{h}_{t}\left({x}_{n}\right)\right),{y}_{n}\right)$
is the ensemble resubstitution error. The error is measured by mean squared error (MSE).
If you use λ = 0, regularize finds the weak learner weights by minimizing the resubstitution MSE. Ensembles tend to overtrain. In other words, the resubstitution error is typically smaller than the true generalization error. By making the resubstitution error even smaller, you are likely to make the ensemble accuracy worse instead of improving it. On the other hand, positive values of λ push the magnitude of the α_{t} coefficients to 0. This often improves the generalization error. Of course, if you choose λ too large, all the optimal coefficients are 0, and the ensemble does not have any accuracy. Usually you can find an optimal range for λ in which the accuracy of the regularized ensemble is better or comparable to that of the full ensemble without regularization.
A nice feature of lasso regularization is its ability to drive the optimized coefficients precisely to 0. If a learner's weight α_{t} is 0, this learner can be excluded from the regularized ensemble. In the end, you get an ensemble with improved accuracy and fewer learners.
This example uses data for predicting the insurance risk of a car based on its many attributes.
Load the imports-85 data into the MATLAB workspace:
load imports-85;
Look at a description of the data to find the categorical variables and predictor names:
Description
Description = 1985 Auto Imports Database from the UCI repository http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names Variables have been reordered to place variables with numeric values (referred to as "continuous" on the UCI site) to the left and categorical values to the right. Specifically, variables 1:16 are: symboling, normalized-losses, wheel-base, length, width, height, curb-weight, engine-size, bore, stroke, compression-ratio, horsepower, peak-rpm, city-mpg, highway-mpg, and price. Variables 17:26 are: make, fuel-type, aspiration, num-of-doors, body-style, drive-wheels, engine-location, engine-type, num-of-cylinders, and fuel-system.
The objective of this process is to predict the "symboling," the first variable in the data, from the other predictors. "symboling" is an integer from -3 (good insurance risk) to 3 (poor insurance risk). You could use a classification ensemble to predict this risk instead of a regression ensemble. When you have a choice between regression and classification, you should try regression first.
Prepare the data for ensemble fitting:
Y = X(:,1); X(:,1) = []; VarNames = {'normalized-losses' 'wheel-base' 'length' 'width' 'height' ... 'curb-weight' 'engine-size' 'bore' 'stroke' 'compression-ratio' ... 'horsepower' 'peak-rpm' 'city-mpg' 'highway-mpg' 'price' 'make' ... 'fuel-type' 'aspiration' 'num-of-doors' 'body-style' 'drive-wheels' ... 'engine-location' 'engine-type' 'num-of-cylinders' 'fuel-system'}; catidx = 16:25; % indices of categorical predictors
Create a regression ensemble from the data using 300 default trees:
ls = fitensemble(X,Y,'LSBoost',300,'Tree','LearnRate',0.1,... 'PredictorNames',VarNames,'ResponseName','Symboling',... 'CategoricalPredictors',catidx)
ls = classreg.learning.regr.RegressionEnsemble PredictorNames: {1x25 cell} ResponseName: 'Symboling' ResponseTransform: 'none' NumObservations: 205 NumTrained: 300 Method: 'LSBoost' LearnerNames: {'Tree'} ReasonForTermination: 'Terminated normally after completing the reques...' FitInfo: [300x1 double] FitInfoDescription: {2x1 cell} Regularization: []
The final line, Regularization, is empty ([]). To regularize the ensemble, you have to use the regularize method.
cv = crossval(ls,'KFold',5); figure; plot(kfoldLoss(cv,'Mode','Cumulative')); xlabel('Number of trees'); ylabel('Cross-validated MSE'); ylim([0.2,2])
It appears you might obtain satisfactory performance from a smaller ensemble, perhaps one containing from 50 to 100 trees.
6.Call the regularize method to try to find trees that you can remove from the ensemble. By default, regularize examines 10 values of the lasso (Lambda) parameter spaced exponentially.
ls = regularize(ls)
ls = classreg.learning.regr.RegressionEnsemble PredictorNames: {1x25 cell} ResponseName: 'Symboling' ResponseTransform: 'none' NumObservations: 205 NumTrained: 300 Method: 'LSBoost' LearnerNames: {'Tree'} ReasonForTermination: 'Terminated normally after completing the reques...' FitInfo: [300x1 double] FitInfoDescription: {2x1 cell} Regularization: [1x1 struct]
The Regularization property is no longer empty.
Plot the resubstitution mean-squared error (MSE) and number of learners with nonzero weights against the lasso parameter. Separately plot the value at Lambda = 0. Use a logarithmic scale because the values of Lambda are exponentially spaced.
figure; semilogx(ls.Regularization.Lambda,ls.Regularization.ResubstitutionMSE); line([1e-3 1e-3],[ls.Regularization.ResubstitutionMSE(1) ... ls.Regularization.ResubstitutionMSE(1)],... 'Marker','x','Markersize',12,'Color','b'); r0 = resubLoss(ls); line([ls.Regularization.Lambda(2) ls.Regularization.Lambda(end)],... [r0 r0],'Color','r','LineStyle','--'); xlabel('Lambda'); ylabel('Resubstitution MSE'); annotation('textbox',[0.5 0.22 0.5 0.05],'String','unregularized ensemble',... 'Color','r','FontSize',14,'LineStyle','none'); figure; loglog(ls.Regularization.Lambda,sum(ls.Regularization.TrainedWeights>0,1)); line([1e-3 1e-3],... [sum(ls.Regularization.TrainedWeights(:,1)>0) ... sum(ls.Regularization.TrainedWeights(:,1)>0)],... 'marker','x','markersize',12,'color','b'); line([ls.Regularization.Lambda(2) ls.Regularization.Lambda(end)],... [ls.NTrained ls.NTrained],... 'color','r','LineStyle','--'); xlabel('Lambda'); ylabel('Number of learners'); annotation('textbox',[0.3 0.8 0.5 0.05],'String','unregularized ensemble',... 'color','r','FontSize',14,'LineStyle','none');
The resubstitution MSE values are likely to be overly optimistic. To obtain more reliable estimates of the error associated with various values of Lambda, cross validate the ensemble using cvshrink. Plot the resulting cross-validation loss (MSE) and number of learners against Lambda.
rng(0,'Twister') % for reproducibility [mse,nlearn] = cvshrink(ls,'Lambda',ls.Regularization.Lambda,'KFold',5); figure; semilogx(ls.Regularization.Lambda,ls.Regularization.ResubstitutionMSE); hold; semilogx(ls.Regularization.Lambda,mse,'r--'); hold off; xlabel('Lambda'); ylabel('Mean squared error'); legend('resubstitution','cross-validation','Location','NW'); line([1e-3 1e-3],[ls.Regularization.ResubstitutionMSE(1) ... ls.Regularization.ResubstitutionMSE(1)],... 'Marker','x','Markersize',12,'Color','b'); line([1e-3 1e-3],[mse(1) mse(1)],'Marker','o',... 'Markersize',12,'Color','r','LineStyle','--'); figure; loglog(ls.Regularization.Lambda,sum(ls.Regularization.TrainedWeights>0,1)); hold; loglog(ls.Regularization.Lambda,nlearn,'r--'); hold off; xlabel('Lambda'); ylabel('Number of learners'); legend('resubstitution','cross-validation','Location','NE'); line([1e-3 1e-3],... [sum(ls.Regularization.TrainedWeights(:,1)>0) ... sum(ls.Regularization.TrainedWeights(:,1)>0)],... 'Marker','x','Markersize',12,'Color','b'); line([1e-3 1e-3],[nlearn(1) nlearn(1)],'marker','o',... 'Markersize',12,'Color','r','LineStyle','--');
Warning: Some folds do not have any trained weak learners. Warning: Some folds do not have any trained weak learners. Warning: Some folds do not have any trained weak learners. Current plot held Current plot held
Examining the cross-validated error shows that the cross-validation MSE is almost flat for Lambda up to a bit over 1e-2.
Examine ls.Regularization.Lambda to find the highest value that gives MSE in the flat region (up to a bit over 1e-2):
jj = 1:length(ls.Regularization.Lambda); [jj;ls.Regularization.Lambda]
ans = Columns 1 through 7 1.0000 2.0000 3.0000 4.0000 5.0000 6.0000 7.0000 0 0.0014 0.0033 0.0077 0.0183 0.0435 0.1031 Columns 8 through 10 8.0000 9.0000 10.0000 0.2446 0.5800 1.3754
Element 5 of ls.Regularization.Lambda has value 0.0183, the largest in the flat range.
Reduce the ensemble size using the shrink method. shrink returns a compact ensemble with no training data. The generalization error for the new compact ensemble was already estimated by cross validation in mse(5).
cmp = shrink(ls,'weightcolumn',5)
cmp = classreg.learning.regr.CompactRegressionEnsemble PredictorNames: {1x25 cell} ResponseName: 'Symboling' ResponseTransform: 'none' NumTrained: 15
There are only 15 trees in the new ensemble, notably reduced from the 300 in ls.
Compare the sizes of the ensembles:
sz(1) = whos('cmp'); sz(2) = whos('ls'); [sz(1).bytes sz(2).bytes]
ans = 84544 1775691
The reduced ensemble is about 5% the size of the original.
Compare the MSE of the reduced ensemble to that of the original ensemble:
figure; plot(kfoldLoss(cv,'mode','cumulative')); hold on plot(cmp.NTrained,mse(5),'ro','MarkerSize',12); xlabel('Number of trees'); ylabel('Cross-validated MSE'); legend('unregularized ensemble','regularized ensemble',... 'Location','NE'); hold off
The reduced ensemble gives low loss while using many fewer trees.
The RobustBoost algorithm can make good classification predictions even when the training data has noise. However, the default RobustBoost parameters can produce an ensemble that does not predict well. This example shows one way of tuning the parameters for better predictive accuracy.
Note that RobustBoost requires an Optimization Toolbox™ license.
Generate data with label noise. This example has twenty uniform random numbers per observation, and classifies the observation as 1 if the sum of the first five numbers exceeds 2.5 (so is larger than average), and 0 otherwise:
rng(0,'twister') % for reproducibility Xtrain = rand(2000,20); Ytrain = sum(Xtrain(:,1:5),2) > 2.5;
To add noise, randomly switch 10% of the classifications:
idx = randsample(2000,200); Ytrain(idx) = ~Ytrain(idx);
Create an ensemble with AdaBoostM1 for comparison purposes:
ada = fitensemble(Xtrain,Ytrain,'AdaBoostM1',... 300,'Tree','LearnRate',0.1);
Create an ensemble with RobustBoost. Because the data has 10% incorrect classification, perhaps an error goal of 15% is reasonable.
rb1 = fitensemble(Xtrain,Ytrain,'RobustBoost',300,... 'Tree','RobustErrorGoal',0.15,'RobustMaxMargin',1);
Note that if you set the error goal to a high enough value, then the software returns an error.
Create an ensemble with very optimistic error goal, 0.01:
rb2 = fitensemble(Xtrain,Ytrain,'RobustBoost',300,... 'Tree','RobustErrorGoal',0.01);
Compare the resubstitution error of the four ensembles:
figure plot(resubLoss(rb1,'Mode','Cumulative')); hold on plot(resubLoss(rb2,'Mode','Cumulative'),'r--'); plot(resubLoss(ada,'Mode','Cumulative'),'g.'); hold off; xlabel('Number of trees'); ylabel('Resubstitution error'); legend('ErrorGoal=0.15','ErrorGoal=0.01',... 'AdaBoostM1','Location','NE');
All the RobustBoost curves show lower resubstitution error than the AdaBoostM1 curve. The error goal of 0.15 curve shows the lowest resubstitution error over most of the range.
Xtest = rand(2000,20); Ytest = sum(Xtest(:,1:5),2) > 2.5; idx = randsample(2000,200); Ytest(idx) = ~Ytest(idx); figure; plot(loss(rb1,Xtest,Ytest,'Mode','Cumulative')); hold on plot(loss(rb2,Xtest,Ytest,'Mode','Cumulative'),'r--'); plot(loss(ada,Xtest,Ytest,'Mode','Cumulative'),'g.'); hold off; xlabel('Number of trees'); ylabel('Test error'); legend('ErrorGoal=0.15','ErrorGoal=0.01',... 'AdaBoostM1','Location','NE');
The error curve for error goal 0.15 is lowest (best) in the plotted range. AdaBoostM1 has higher error than the curve for error goal 0.15. The curve for the too-optimistic error goal 0.01 remains substantially higher (worse) than the other algorithms for most of the plotted range.
This example shows how to use a random subspace ensemble to increase the accuracy of classification. It also shows how to use cross validation to determine good parameters for both the weak learner template and the ensemble.
Load the ionosphere data. This data has 351 binary responses to 34 predictors.
load ionosphere;
[N,D] = size(X)
resp = unique(Y)
N = 351 D = 34 resp = 'b' 'g'
Choose the number of nearest neighbors
Find a good choice for k, the number of nearest neighbors in the classifier, by cross validation. Choose the number of neighbors approximately evenly spaced on a logarithmic scale.
rng(8000,'twister') % for reproducibility K = round(logspace(0,log10(N),10)); % number of neighbors cvloss = zeros(numel(K),1); for k=1:numel(K) knn = fitcknn(X,Y,... 'NumNeighbors',K(k),'CrossVal','On'); cvloss(k) = kfoldLoss(knn); end figure; % Plot the accuracy versus k semilogx(K,cvloss); xlabel('Number of nearest neighbors'); ylabel('10 fold classification error'); title('k-NN classification');
The lowest cross-validation error occurs for k = 2.
Create ensembles for 2-nearest neighbor classification with various numbers of dimensions, and examine the cross-validated loss of the resulting ensembles.
This step takes a long time. To keep track of the progress, print a message as each dimension finishes.
NPredToSample = round(linspace(1,D,10)); % linear spacing of dimensions cvloss = zeros(numel(NPredToSample),1); learner = templateKNN('NumNeighbors',2); for npred=1:numel(NPredToSample) subspace = fitensemble(X,Y,'Subspace',200,learner,... 'NPredToSample',NPredToSample(npred),'CrossVal','On'); cvloss(npred) = kfoldLoss(subspace); fprintf('Random Subspace %i done.\n',npred); end figure; % plot the accuracy versus dimension plot(NPredToSample,cvloss); xlabel('Number of predictors selected at random'); ylabel('10 fold classification error'); title('k-NN classification with Random Subspace');
Random Subspace 1 done. Random Subspace 2 done. Random Subspace 3 done. Random Subspace 4 done. Random Subspace 5 done. Random Subspace 6 done. Random Subspace 7 done. Random Subspace 8 done. Random Subspace 9 done. Random Subspace 10 done.
The ensemble with five predictors per learner has the lowest cross-validated error. The error rate, about 0.06, is about half the amount for the individual learners, which was over 0.10.
Find the smallest number of learners in the ensemble that still give good classification.
ens = fitensemble(X,Y,'Subspace',200,learner,... 'NPredToSample',5,'CrossVal','on'); figure; % Plot the accuracy versus number in ensemble plot(kfoldLoss(ens,'Mode','Cumulative')) xlabel('Number of learners in ensemble'); ylabel('10 fold classification error'); title('k-NN classification with Random Subspace');
There seems to be no advantage in an ensemble with more than 150 or so learners. It is not clear whether there is any advantage with more than 50 learners. It is possible that 25 learners gives good predictions.
Construct a final ensemble with 50 learners. Compact the ensemble and see if the compacted version saves an appreciable amount of memory.
ens = fitensemble(X,Y,'Subspace',50,learner,... 'NPredToSample',5); cens = compact(ens); s1 = whos('ens');s2 = whos('cens'); [s1.bytes s2.bytes] % si.bytes = size in bytes
ans = 1756230 1525678
The compact ensemble is about 10% smaller than the full ensemble. Both give the same predictions.
Regression of Insurance Risk Rating for Car Imports Using TreeBagger
Classifying Radar Returns for Ionosphere Data Using TreeBagger
TreeBagger ensembles have more functionality than those constructed with fitensemble; see TreeBagger Features Not in fitensemble. Also, some property and method names differ from their fitensemble counterparts. This section contains examples of workflow for regression and classification that use this extra TreeBagger functionality.
In this example, use a database of 1985 car imports with 205 observations, 25 predictors, and 1 response, which is insurance risk rating, or "symboling." The first 15 variables are numeric and the last 10 are categorical. The symboling index takes integer values from -3 to 3.
Load the data set and split it into predictor and response arrays.
load imports-85; Y = X(:,1); X = X(:,2:end); isCategorical = [zeros(15,1);ones(size(X,2)-15,1)]; % Categorical variable flag
Because bagging uses randomized data drawings, its exact outcome depends on the initial random seed. To reproduce the results in this example, use the random stream settings.
rng(1945,'twister')
For regression, the general rule is to the set leaf size to 5 and select one third of the input features for decision splits at random. In the following step, verify the optimal leaf size by comparing mean squared errors obtained by regression for various leaf sizes. oobError computes MSE versus the number of grown trees. You must set OOBPred to 'On' to obtain out-of-bag predictions later.
leaf = [5 10 20 50 100]; col = 'rbcmy'; figure for i=1:length(leaf) b = TreeBagger(50,X,Y,'Method','R','OOBPred','On',... 'CategoricalPredictors',find(isCategorical == 1),'MinLeaf',leaf(i)); plot(oobError(b),col(i)); hold on; end xlabel 'Number of Grown Trees'; ylabel 'Mean Squared Error' ; legend({'5' '10' '20' '50' '100'},'Location','NorthEast'); hold off;
The red curve (leaf size 5) yields the lowest MSE values.
In practical applications, you typically grow ensembles with hundreds of trees. For example, the previous code block uses 50 trees for faster processing. Now that you have estimated the optimal leaf size, grow a larger ensemble with 100 trees and use it to estimate feature importance.
b = TreeBagger(100,X,Y,'Method','R','OOBVarImp','On',... 'CategoricalPredictors',find(isCategorical == 1),... 'MinLeaf',5);
Inspect the error curve again to make sure nothing went wrong during training.
figure plot(oobError(b)); xlabel 'Number of Grown Trees'; ylabel 'Out-of-Bag Mean Squared Error';
Prediction ability should depend more on important features than unimportant features. You can use this idea to measure feature importance.
For each feature, permute the values of this feature across every observation in the data set and measure how much worse the MSE becomes after the permutation. You can repeat this for each feature.
Using the following code, plot the increase in MSE due to permuting out-of-bag observations across each input variable. The OOBPermutedVarDeltaError array stores the increase in MSE averaged over all trees in the ensemble and divided by the standard deviation taken over the trees, for each variable. The larger this value, the more important the variable. Imposing an arbitrary cutoff at 0.7, you can select the five most important features.
figure bar(b.OOBPermutedVarDeltaError); xlabel 'Feature Number' ; ylabel 'Out-of-Bag Feature Importance'; idxvar = find(b.OOBPermutedVarDeltaError>0.7) idxCategorical = find(isCategorical(idxvar)==1);
idxvar = 1 2 16 19
The OOBIndices property of TreeBagger tracks which observations are out of bag for what trees. Using this property, you can monitor the fraction of observations in the training data that are in bag for all trees. The curve starts at approximately 2/3, which is the fraction of unique observations selected by one bootstrap replica, and goes down to 0 at approximately 10 trees.
finbag = zeros(1,b.NTrees); for t=1:b.NTrees finbag(t) = sum(all(~b.OOBIndices(:,1:t),2)); end finbag = finbag / size(X,1); figure plot(finbag); xlabel 'Number of Grown Trees'; ylabel 'Fraction of In-Bag Observations';
Growing Trees on a Reduced Set of Features
Using just the five most powerful features, determine if it is possible to obtain a similar predictive power. To begin, grow 100 trees on these features only. The first three of the five selected features are numeric and the last two are categorical.
b5v = TreeBagger(100,X(:,idxvar),Y,'Method','R',... 'OOBVarImp','On','CategoricalPredictors',idxCategorical,... 'MinLeaf',5); figure plot(oobError(b5v)); xlabel 'Number of Grown Trees'; ylabel 'Out-of-Bag Mean Squared Error'; figure bar(b5v.OOBPermutedVarDeltaError); xlabel 'Feature Index'; ylabel 'Out-of-Bag Feature Importance';
These five most powerful features give the same MSE as the full set, and the ensemble trained on the reduced set ranks these features similarly to each other. If you remove features 1 and 2 from the reduced set, then the predictive power of the algorithm might not decrease significantly.
To find outliers in the training data, compute the proximity matrix using fillProximities.
b5v = fillProximities(b5v);
The method normalizes this measure by subtracting the mean outlier measure for the entire sample. Then it takes the magnitude of this difference and divides the result by the median absolute deviation for the entire sample.
figure hist(b5v.OutlierMeasure); xlabel 'Outlier Measure'; ylabel 'Number of Observations';
Discovering Clusters in the Data
By applying multidimensional scaling to the computed matrix of proximities, you can inspect the structure of the input data and look for possible clusters of observations. The mdsProx method returns scaled coordinates and eigenvalues for the computed proximity matrix. If you run it with the Colors name-value-pair argument, then this method creates a scatter plot of two scaled coordinates.
figure(8); [~,e] = mdsProx(b5v,'Colors','K'); xlabel 'First Scaled Coordinate'; ylabel 'Second Scaled Coordinate';
Assess the relative importance of the scaled axes by plotting the first 20 eigenvalues.
figure bar(e(1:20)); xlabel 'Scaled Coordinate Index'; ylabel 'Eigenvalue';
Saving the Ensemble Configuration for Future Use
To use the trained ensemble for predicting the response on unseen data, store the ensemble to disk and retrieve it later. If you do not want to compute predictions for out-of-bag data or reuse training data in any other way, there is no need to store the ensemble object itself. Saving the compact version of the ensemble is enough in this case. Extract the compact object from the ensemble.
c = compact(b5v)
c = CompactTreeBagger Ensemble with 100 bagged decision trees: Method: regression Nvars: 4
You can save the resulting CompactTreeBagger model in a *.mat file.
You can also use ensembles of decision trees for classification. For this example, use ionosphere data with 351 observations and 34 real-valued predictors. The response variable is categorical with two levels:
'g' represents good radar returns.
'b' represents bad radar returns.
The goal is to predict good or bad returns using a set of 34 measurements.
Fix the initial random seed, grow 50 trees, inspect how the ensemble error changes with accumulation of trees, and estimate feature importance. For classification, it is best to set the minimal leaf size to 1 and select the square root of the total number of features for each decision split at random. These settings are defaults for TreeBagger used for classification.
load ionosphere; rng(1945,'twister') b = TreeBagger(50,X,Y,'OOBVarImp','On'); figure plot(oobError(b)); xlabel('Number of Grown Trees'); ylabel('Out-of-Bag Classification Error');
The method trains ensembles with few trees on observations that are in bag for all trees. For such observations, it is impossible to compute the true out-of-bag prediction, and TreeBagger returns the most probable class for classification and the sample mean for regression. You can change the default value returned for in-bag observations using the DefaultYfit property. If you set the default value to an empty string for classification, the method excludes in-bag observations from computation of the out-of-bag error. In this case, the curve is more variable when the number of trees is small, either because some observations are never out of bag (and are therefore excluded) or because their predictions are based on few trees.
b.DefaultYfit = ''; figure plot(oobError(b)); xlabel('Number of Grown Trees'); ylabel('Out-of-Bag Error Excluding In-Bag Observations');
The OOBIndices property of TreeBagger tracks which observations are out of bag for what trees. Using this property, you can monitor the fraction of observations in the training data that are in bag for all trees. The curve starts at approximately 2/3, which is the fraction of unique observations selected by one bootstrap replica, and goes down to 0 at approximately 10 trees.
finbag = zeros(1,b.NTrees); for t=1:b.NTrees finbag(t) = sum(all(~b.OOBIndices(:,1:t),2)); end finbag = finbag / size(X,1); figure plot(finbag); xlabel('Number of Grown Trees'); ylabel('Fraction of In-Bag Observations');
Estimate feature importance.
figure bar(b.OOBPermutedVarDeltaError); xlabel('Feature Index'); ylabel('Out-of-Bag Feature Importance'); idxvar = find(b.OOBPermutedVarDeltaError>0.75)
idxvar = 3 5 8 27
Having selected the five most important features, grow a larger ensemble on the reduced feature set. Save time by not permuting out-of-bag observations to obtain new estimates of feature importance for the reduced feature set (set OOBVarImp to 'off'). You would still be interested in obtaining out-of-bag estimates of classification error (set OOBPred to 'on').
b5v = TreeBagger(100,X(:,idxvar),Y,'OOBVarImp','off','OOBPred','on'); figure plot(oobError(b5v)); xlabel('Number of Grown Trees'); ylabel('Out-of-Bag Classification Error');
For classification ensembles, in addition to classification error (fraction of misclassified observations), you can also monitor the average classification margin. For each observation, the margin is defined as the difference between the score for the true class and the maximal score for other classes predicted by this tree. The cumulative classification margin uses the scores averaged over all trees and the mean cumulative classification margin is the cumulative margin averaged over all observations. The oobMeanMargin method with the 'mode' argument set to 'cumulative' (default) shows how the mean cumulative margin changes as the ensemble grows: every new element in the returned array represents the cumulative margin obtained by including a new tree in the ensemble. If training is successful, you would expect to see a gradual increase in the mean classification margin.
The method trains ensembles with few trees on observations that are in bag for all trees. For such observations, it is impossible to compute the true out-of-bag prediction, and TreeBagger returns the most probable class for classification and the sample mean for regression.
For decision trees, a classification score is the probability of observing an instance of this class in this tree leaf. For example, if the leaf of a grown decision tree has five 'good' and three 'bad' training observations in it, the scores returned by this decision tree for any observation fallen on this leaf are 5/8 for the 'good' class and 3/8 for the 'bad' class. These probabilities are called 'scores' for consistency with other classifiers that might not have an obvious interpretation for numeric values of returned predictions.
figure plot(oobMeanMargin(b5v)); xlabel('Number of Grown Trees'); ylabel('Out-of-Bag Mean Classification Margin');
Compute the matrix of proximities and examine the distribution of outlier measures. Unlike regression, outlier measures for classification ensembles are computed within each class separately.
b5v = fillProximities(b5v); figure hist(b5v.OutlierMeasure); xlabel('Outlier Measure'); ylabel('Number of Observations');
Slightly more than half of the extreme outliers are labeled 'bad'.
extremeOutliers = b5v.Y(b5v.OutlierMeasure>40)
percentBad = 100*sum(strcmp(extremeOutliers,'b'))/numel(extremeOutliers)
extremeOutliers = 'g' 'g' 'g' 'g' 'g' 'g' percentBad = 0
As for regression, you can plot scaled coordinates, displaying the two classes in different colors using the 'Colors' name-value pair argument of mdsProx. This argument takes a string in which every character represents a color. The software does not rank class names. Therefore, it is best practice to determine the position of the classes in the ClassNames property of the ensemble.
gPosition = find(strcmp('g',b5v.ClassNames))
gPosition = 2
The 'bad' class is first and the 'good' class is second. Display scaled coordinates using red for the 'bad' class and blue for the 'good' class observations.
figure [s,e] = mdsProx(b5v,'Colors','rb'); xlabel('First Scaled Coordinate'); ylabel('Second Scaled Coordinate');
Plot the first 20 eigenvalues obtained by scaling. The first eigenvalue clearly dominates and the first scaled coordinate is most important.
figure bar(e(1:20)); xlabel('Scaled Coordinate Index'); ylabel('Eigenvalue');
Another way of exploring the performance of a classification ensemble is to plot its Receiver Operating Characteristic (ROC) curve or another performance curve suitable for the current problem. Obtain predictions for out-of-bag observations. For a classification ensemble, the oobPredict method returns a cell array of classification labels as the first output argument and a numeric array of scores as the second output argument. The returned array of scores has two columns, one for each class. In this case, the first column is for the 'bad' class and the second column is for the 'good' class. One column in the score matrix is redundant because the scores represent class probabilities in tree leaves and by definition add up to 1.
[Yfit,Sfit] = oobPredict(b5v);
Use perfcurve to compute a performance curve. By default, perfcurve returns the standard ROC curve, which is the true positive rate versus the false positive rate. perfcurve requires true class labels, scores, and the positive class label for input. In this case, choose the 'good' class as positive.
[fpr,tpr] = perfcurve(b5v.Y,Sfit(:,gPosition),'g'); figure plot(fpr,tpr); xlabel('False Positive Rate'); ylabel('True Positive Rate');
Instead of the standard ROC curve, you might want to plot, for example, ensemble accuracy versus threshold on the score for the 'good' class. The ycrit input argument of perfcurve lets you specify the criterion for the y-axis, and the third output argument of perfcurve returns an array of thresholds for the positive class score. Accuracy is the fraction of correctly classified observations, or equivalently, 1 minus the classification error.
[fpr,accu,thre] = perfcurve(b5v.Y,Sfit(:,gPosition),'g','YCrit','Accu'); figure(20); plot(thre,accu); xlabel('Threshold for ''good'' Returns'); ylabel('Classification Accuracy');
The curve shows a flat region indicating that any threshold from 0.2 to 0.6 is a reasonable choice. By default, the perfcurve assigns classification labels using 0.5 as the boundary between the two classes. You can find exactly what accuracy this corresponds to.
accu(abs(thre-0.5)<eps)
ans = Empty matrix: 0-by-1
The maximal accuracy is a little higher than the default one.
[maxaccu,iaccu] = max(accu)
maxaccu = 0.9316 iaccu = 104
The optimal threshold is therefore.
thre(iaccu)
ans = 0.4570
AdaBoostM1 is a very popular boosting algorithm for binary classification. The algorithm trains learners sequentially. For every learner with index t, AdaBoostM1 computes the weighted classification error
$${\epsilon}_{t}={\displaystyle \sum}_{n=1}^{N}{d}_{n}^{\left(t\right)}\mathbb{I}\left({y}_{n}\ne {h}_{t}\left({x}_{n}\right)\right),$$
where
x_{n} is a vector of predictor values for observation n.
y_{n} is the true class label.
h_{t} is the prediction of learner (hypothesis) with index t.
$$\mathbb{I}$$ is the indicator function.
$${d}_{n}^{\left(t\right)}$$ is the weight of observation n at step t.
AdaBoostM1 then increases weights for observations misclassified by learner t and reduces weights for observations correctly classified by learner t. The next learner t + 1 is then trained on the data with updated weights ${d}_{n}^{\left(t+1\right)}$.
After training finishes, AdaBoostM1 computes prediction for new data using
$f\left(x\right)={\displaystyle \sum}_{t=1}^{T}{\alpha}_{t}{h}_{t}\left(x\right),$
where
${\alpha}_{t}=\frac{1}{2}\mathrm{log}\frac{1-{\epsilon}_{t}}{{\epsilon}_{t}}$
are weights of the weak hypotheses in the ensemble.
Training by AdaBoostM1 can be viewed as stagewise minimization of the exponential loss
$\sum}_{n=1}^{N}{w}_{n}\mathrm{exp}\left(-{y}_{n}f\left({x}_{n}\right)\right),$
where
y_{n} ∊ {–1,+1} is the true class label.
w_{n} are observation weights normalized to add up to 1.
f(x_{n}) ∊ (–∞,+∞) is the predicted classification score.
The observation weights w_{n} are the original observation weights you passed to fitensemble.
The second output from the predict method of an AdaBoostM1 classification ensemble is an N-by-2 matrix of classification scores for the two classes and N observations. The second column in this matrix is always equal to minus the first column. predict returns two scores to be consistent with multiclass models, though this is redundant because the second column is always the negative of the first.
Most often AdaBoostM1 is used with decision stumps (default) or shallow trees. If boosted stumps give poor performance, try setting the minimal parent node size to one quarter of the training data.
By default, the learning rate for boosting algorithms is 1. If you set the learning rate to a lower number, the ensemble learns at a slower rate, but can converge to a better solution. 0.1 is a popular choice for the learning rate. Learning at a rate less than 1 is often called "shrinkage".
For examples using AdaBoostM1, see Tune RobustBoost.
For references related to AdaBoostM1, see Freund and Schapire [9], Schapire et al. [15], Friedman, Hastie, and Tibshirani [11], and Friedman [10].
AdaBoostM2 is an extension of AdaBoostM1 for multiple classes. Instead of weighted classification error, AdaBoostM2 uses weighted pseudo-loss for N observations and K classes
${\epsilon}_{t}=\frac{1}{2}{\displaystyle \sum}_{n=1}^{N}{\displaystyle \sum}_{k\ne {y}_{n}}{d}_{n,k}^{\left(t\right)}\left(1-{h}_{t}\left({x}_{n},{y}_{n}\right)+{h}_{t}\left({x}_{n},k\right)\right),$
where
h_{t}(x_{n},k) is the confidence of prediction by learner at step t into class k ranging from 0 (not at all confident) to 1 (highly confident).
${d}_{n,k}^{\left(t\right)}$ are observation weights at step t for class k.
y_{n} is the true class label taking one of the K values.
The second sum is over all classes other than the true class y_{n}.
Interpreting the pseudo-loss is harder than classification error, but the idea is the same. Pseudo-loss can be used as a measure of the classification accuracy from any learner in an ensemble. Pseudo-loss typically exhibits the same behavior as a weighted classification error for AdaBoostM1: the first few learners in a boosted ensemble give low pseudo-loss values. After the first few training steps, the ensemble begins to learn at a slower pace, and the pseudo-loss value approaches 0.5 from below.
For examples using AdaBoostM2, see Create a Classification Ensemble.
For references related to AdaBoostM2, see Freund and Schapire [9].
Bagging, which stands for "bootstrap aggregation," is a type of ensemble learning. To bag a weak learner such as a decision tree on a dataset, generate many bootstrap replicas of this dataset and grow decision trees on these replicas. Obtain each bootstrap replica by randomly selecting N observations out of N with replacement, where N is the dataset size. To find the predicted response of a trained ensemble, take an average over predictions from individual trees.
Bagged decision trees were introduced in MATLAB R2009a as TreeBagger. The fitensemble function lets you bag in a manner consistent with boosting. An ensemble of bagged trees, either ClassificationBaggedEnsemble or RegressionBaggedEnsemble, returned by fitensemble offers almost the same functionally as TreeBagger. Discrepancies between TreeBagger and the new framework are described in detail in TreeBagger Features Not in fitensemble.
Bagging works by training learners on resampled versions of the data. This resampling is usually done by bootstrapping observations, that is, selecting N out of N observations with replacement for every new learner. In addition, every tree in the ensemble can randomly select predictors for decision splits—a technique known to improve the accuracy of bagged trees.
By default, the minimal leaf sizes for bagged trees are set to 1 for classification and 5 for regression. Trees grown with the default leaf size are usually very deep. These settings are close to optimal for the predictive power of an ensemble. Often you can grow trees with larger leaves without losing predictive power. Doing so reduces training and prediction time, as well as memory usage for the trained ensemble.
Another important parameter is the number of predictors selected at random for every decision split. This random selection is made for every split, and every deep tree involves many splits. By default, this parameter is set to a square root of the number of predictors for classification, and one third of predictors for regression.
Several features of bagged decision trees make them a unique algorithm. Drawing N out of N observations with replacement omits on average 37% of observations for each decision tree. These are "out-of-bag" observations. You can use them to estimate the predictive power and feature importance. For each observation, you can estimate the out-of-bag prediction by averaging over predictions from all trees in the ensemble for which this observation is out of bag. You can then compare the computed prediction against the observed response for this observation. By comparing the out-of-bag predicted responses against the observed responses for all observations used for training, you can estimate the average out-of-bag error. This out-of-bag average is an unbiased estimator of the true ensemble error. You can also obtain out-of-bag estimates of feature importance by randomly permuting out-of-bag data across one variable or column at a time and estimating the increase in the out-of-bag error due to this permutation. The larger the increase, the more important the feature. Thus, you need not supply test data for bagged ensembles because you obtain reliable estimates of the predictive power and feature importance in the process of training, which is an attractive feature of bagging.
Another attractive feature of bagged decision trees is the proximity matrix. Every time two observations land on the same leaf of a tree, their proximity increases by 1. For normalization, sum these proximities over all trees in the ensemble and divide by the number of trees. The resulting matrix is symmetric with diagonal elements equal to 1 and off-diagonal elements ranging from 0 to 1. You can use this matrix for finding outlier observations and discovering clusters in the data through multidimensional scaling.
For examples using bagging, see:
Regression of Insurance Risk Rating for Car Imports Using TreeBagger
Classifying Radar Returns for Ionosphere Data Using TreeBagger
For references related to bagging, see Breiman [3], [4], and [5].
Comparison of TreeBagger and Bagged Ensembles. fitensemble produces bagged ensembles that have most, but not all, of the functionality of TreeBagger objects. Additionally, some functionality has different names in the new bagged ensembles.
TreeBagger Features Not in fitensemble
Feature | TreeBagger Property | TreeBagger Method |
---|---|---|
Computation of proximity matrix | Proximity | fillProximities, mdsProx |
Computation of outliers | OutlierMeasure | N/A |
Out-of-bag estimates of predictor importance | OOBPermutedVarDeltaError, OOBPermutedVarDeltaMeanMargin, OOBPermutedVarCountRaiseMargin | N/A |
Merging two ensembles trained separately | N/A | append |
Parallel computation for creating ensemble | Set the UseParallel name-value pair to true | N/A |
Differing Names Between TreeBagger and Bagged Ensembles
Feature | TreeBagger | Bagged Ensembles |
---|---|---|
Split criterion contributions for each predictor | DeltaCritDecisionSplit property | First output of predictorImportance (classification) or predictorImportance (regression) |
Predictor associations | VarAssoc property | Second output of predictorImportance (classification) or predictorImportance (regression) |
Error (misclassification probability or mean-squared error) | error and oobError methods | loss and oobLoss methods (classification); loss and oobLoss methods (regression) |
Train additional trees and add to ensemble | growTrees method | resume method (classification); resume method (regression) |
Mean classification margin per tree | meanMargin and oobMeanMargin methods | edge and oobEdge methods (classification) |
In addition, two important changes were made to training and prediction for bagged classification ensembles:
If you pass a misclassification cost matrix to TreeBagger, it passes this matrix along to the trees. If you pass a misclassification cost matrix to fitensemble, it uses this matrix to adjust the class prior probabilities. fitensemble then passes the adjusted prior probabilities and the default cost matrix to the trees. The default cost matrix is ones(K)-eye(K) for K classes.
Unlike the loss and edge methods in the new framework, the TreeBagger error and meanMargin methods do not normalize input observation weights of the prior probabilities in the respective class.
GentleBoost (also known as Gentle AdaBoost) combines features of AdaBoostM1 and LogitBoost. Like AdaBoostM1, GentleBoost minimizes the exponential loss. But its numeric optimization is set up differently. Like LogitBoost, every weak learner fits a regression model to response values y_{n} ∊ {–1,+1}. This makes GentleBoost another good candidate for binary classification of data with multilevel categorical predictors.
fitensemble computes and stores the mean-squared error in the FitInfo property of the ensemble object. The mean-squared error is
$\sum}_{n=1}^{N}{d}_{n}^{\left(t\right)}{\left({\tilde{y}}_{n}-{h}_{t}\left({x}_{n}\right)\right)}^{2},$
where
${d}_{n}^{\left(t\right)}$ are observation weights at step t (the weights add up to 1).
h_{t}(x_{n}) are predictions of the regression model h_{t} fitted to response values y_{n}.
As the strength of individual learners weakens, the weighted mean-squared error approaches 1.
For examples using GentleBoost, see Example: Unequal Classification Costs and Classification with Many Categorical Levels.
For references related to GentleBoost, see Friedman, Hastie, and Tibshirani [11].
LogitBoost is another popular algorithm for binary classification. LogitBoost works similarly to AdaBoostM1, except it minimizes binomial deviance
$\sum}_{n=1}^{N}{w}_{n}\mathrm{log}\left(1+\mathrm{exp}\left(-2{y}_{n}f\left({x}_{n}\right)\right)\right),$
where
y_{n} ∊ {–1,+1} is the true class label.
w_{n} are observation weights normalized to add up to 1.
f(x_{n}) ∊ (–∞,+∞) is the predicted classification score.
Binomial deviance assigns less weight to badly misclassified observations (observations with large negative values of y_{n}f(x_{n})). LogitBoost can give better average accuracy than AdaBoostM1 for data with poorly separable classes.
Learner t in a LogitBoost ensemble fits a regression model to response values
${\tilde{y}}_{n}=\frac{{y}_{n}^{*}-{p}_{t}\left({x}_{n}\right)}{{p}_{t}\left({x}_{n}\right)\left(1-{p}_{t}\left({x}_{n}\right)\right)},$
where
y*_{n} ∊ {0,+1} are relabeled classes (0 instead of –1).
p_{t}(x_{n}) is the current ensemble estimate of the probability for observation x_{n} to be of class 1.
Fitting a regression model at each boosting step turns into a great computational advantage for data with multilevel categorical predictors. Take a categorical predictor with L levels. To find the optimal decision split on such a predictor, classification tree needs to consider 2^{L–1} – 1 splits. A regression tree needs to consider only L – 1 splits, so the processing time can be much shorter. LogitBoost is recommended for categorical predictors with many levels.
fitensemble computes and stores the mean-squared error in the FitInfo property of the ensemble object. The mean-squared error is
$\sum}_{n=1}^{N}{d}_{n}^{\left(t\right)}{\left({\tilde{y}}_{n}-{h}_{t}\left({x}_{n}\right)\right)}^{2},$
where
${d}_{n}^{\left(t\right)}$ are observation weights at step t (the weights add up to 1).
h_{t}(x_{n}) are predictions of the regression model h_{t} fitted to response values ${\tilde{y}}_{n}$.
Values y_{n} can range from –∞ to +∞, so the mean-squared error does not have well-defined bounds.
For examples using LogitBoost, see Classification with Many Categorical Levels.
For references related to LogitBoost, see Friedman, Hastie, and Tibshirani [11].
LPBoost (linear programming boost), like TotalBoost, performs multiclass classification by attempting to maximize the minimal margin in the training set. This attempt uses optimization algorithms, namely linear programming for LPBoost. So you need an Optimization Toolbox license to use LPBoost or TotalBoost.
The margin of a classification is the difference between the predicted soft classification score for the true class, and the largest score for the false classes. For trees, the score of a classification of a leaf node is the posterior probability of the classification at that node. The posterior probability of the classification at a node is the number of training sequences that lead to that node with the classification, divided by the number of training sequences that lead to that node. For more information, see Definitions in margin.
Why maximize the minimal margin? For one thing, the generalization error (the error on new data) is the probability of obtaining a negative margin. Schapire and Singer [16] establish this inequality on the probability of obtaining a negative margin:
$${P}_{\text{test}}\left(m\le 0\right)\le {P}_{\text{train}}\left(m\le \theta \right)+O\left(\frac{1}{\sqrt{N}}\sqrt{\frac{V{\mathrm{log}}^{2}(N/V)}{{\theta}^{2}}+\mathrm{log}(1/\delta )}\right).$$
Here m is the margin, θ is any positive number, V is the Vapnik-Chervonenkis dimension of the classifier space, N is the size of the training set, and δ is a small positive number. The inequality holds with probability 1–δ over many i.i.d. training and test sets. This inequality says: To obtain a low generalization error, minimize the number of observations below margin θ in the training set.
LPBoost iteratively maximizes the minimal margin through a sequence of linear programming problems. Equivalently, by duality, LPBoost minimizes the maximal edge, where edge is the weighted mean margin (see Definitions). At each iteration, there are more constraints in the problem. So, for large problems, the optimization problem becomes increasingly constrained, and slow to solve.
LPBoost typically creates ensembles with many learners having weights that are orders of magnitude smaller than those of other learners. Therefore, to better enable you to remove the unimportant ensemble members, the compact method reorders the members of an LPBoost ensemble from largest weight to smallest. Therefore, you can easily remove the least important members of the ensemble using the removeLearners method.
For examples using LPBoost, see LPBoost and TotalBoost for Small Ensembles.
For references related to LPBoost, see Warmuth, Liao, and Ratsch [18].
LSBoost (least squares boosting) fits regression ensembles. At every step, the ensemble fits a new learner to the difference between the observed response and the aggregated prediction of all learners grown previously. The ensemble fits to minimize mean-squared error.
You can use LSBoost with shrinkage by passing in the LearnRate parameter. By default this parameter is set to 1, and the ensemble learns at the maximal speed. If you set LearnRate to a value from 0 to 1, the ensemble fits every new learner to y_{n} – ηf(x_{n}), where
y_{n} is the observed response.
f(x_{n}) is the aggregated prediction from all weak learners grown so far for observation x_{n}.
η is the learning rate.
For examples using LSBoost, see Creating a Regression Ensemble and Regularize a Regression Ensemble.
For references related to LSBoost, see Hastie, Tibshirani, and Friedman [12], Chapters 7 (Model Assessment and Selection) and 15 (Random Forests).
Boosting algorithms such as AdaBoostM1 and LogitBoost increase weights for misclassified observations at every boosting step. These weights can become very large. If this happens, the boosting algorithm sometimes concentrates on a few misclassified observations and neglects the majority of training data. Consequently the average classification accuracy suffers.
In this situation, you can try using RobustBoost. This algorithm does not assign almost the entire data weight to badly misclassified observations. It can produce better average classification accuracy.
Unlike AdaBoostM1 and LogitBoost, RobustBoost does not minimize a specific loss function. Instead, it maximizes the number of observations with the classification margin above a certain threshold.
RobustBoost trains based on time evolution. The algorithm starts at t = 0. At every step, RobustBoost solves an optimization problem to find a positive step in time Δt and a corresponding positive change in the average margin for training data Δm. RobustBoost stops training and exits if at least one of these three conditions is true:
Time t reaches 1.
RobustBoost cannot find a solution to the optimization problem with positive updates Δt and Δm.
RobustBoost grows as many learners as you requested.
Results from RobustBoost can be usable for any termination condition. Estimate the classification accuracy by cross validation or by using an independent test set.
To get better classification accuracy from RobustBoost, you can adjust three parameters in fitensemble: RobustErrorGoal, RobustMaxMargin, and RobustMarginSigma. Start by varying values for RobustErrorGoal from 0 to 1. The maximal allowed value for RobustErrorGoal depends on the two other parameters. If you pass a value that is too high, fitensemble produces an error message showing the allowed range for RobustErrorGoal.
For examples using RobustBoost, see Tune RobustBoost.
For references related to RobustBoost, see Freund [8].
RUSBoost is especially effective at classifying imbalanced data, meaning some class in the training data has many fewer members than another. RUS stands for Random Under Sampling. The algorithm takes N, the number of members in the class with the fewest members in the training data, as the basic unit for sampling. Classes with more members are under sampled by taking only N observations of every class. In other words, if there are K classes, then, for each weak learner in the ensemble, RUSBoost takes a subset of the data with N observations from each of the K classes. The boosting procedure follows the procedure in AdaBoostM2 for reweighting and constructing the ensemble.
When you construct a RUSBoost ensemble, there is an optional name-value pair called RatioToSmallest. Give a vector of K values, each value representing the multiple of N to sample for the associated class. For example, if the smallest class has N = 100 members, then RatioToSmallest = [2,3,4] means each weak learner has 200 members in class 1, 300 in class 2, and 400 in class 3. If RatioToSmallest leads to a value that is larger than the number of members in a particular class, then RUSBoost samples the members with replacement. Otherwise, RUSBoost samples the members without replacement.
For examples using RUSBoost, see Classification with Imbalanced Data.
For references related to RUSBoost, see Seiffert et al. [17].
Use random subspace ensembles (Subspace) to improve the accuracy of discriminant analysis (ClassificationDiscriminant) or k-nearest neighbor (ClassificationKNN) classifiers. Subspace ensembles also have the advantage of using less memory than ensembles with all predictors, and can handle missing values (NaNs).
The basic random subspace algorithm uses these parameters.
m is the number of dimensions (variables) to sample in each learner. Set m using the NPredToSample name-value pair.
d is the number of dimensions in the data, which is the number of columns (predictors) in the data matrix X.
n is the number of learners in the ensemble. Set n using the NLearn input.
The basic random subspace algorithm performs the following steps:
Choose without replacement a random set of m predictors from the d possible values.
Train a weak learner using just the m chosen predictors.
Repeat steps 1 and 2 until there are n weak learners.
Predict by taking an average of the score prediction of the weak learners, and classify the category with the highest average score.
You can choose to create a weak learner for every possible set of m predictors from the d dimensions. To do so, set n, the number of learners, to 'AllPredictorCombinations'. In this case, there are nchoosek(size(X,2),NPredToSample) weak learners in the ensemble.
fitensemble downweights predictors after choosing them for a learner, so subsequent learners have a lower chance of using a predictor that was previously used. This weighting tends to make predictors more evenly distributed among learners than in uniform weighting.
For examples using Subspace, see Random Subspace Classification.
For references related to random subspace ensembles, see Ho [13].
TotalBoost, like linear programming boost (LPBoost), performs multiclass classification by attempting to maximize the minimal margin in the training set. This attempt uses optimization algorithms, namely quadratic programming for TotalBoost. So you need an Optimization Toolbox license to use LPBoost or TotalBoost.
The margin of a classification is the difference between the predicted soft classification score for the true class, and the largest score for the false classes. For trees, the score of a classification of a leaf node is the posterior probability of the classification at that node. The posterior probability of the classification at a node is the number of training sequences that lead to that node with the classification, divided by the number of training sequences that lead to that node. For more information, see Definitions in margin.
Why maximize the minimal margin? For one thing, the generalization error (the error on new data) is the probability of obtaining a negative margin. Schapire and Singer [16] establish this inequality on the probability of obtaining a negative margin:
$${P}_{\text{test}}\left(m\le 0\right)\le {P}_{\text{train}}\left(m\le \theta \right)+O\left(\frac{1}{\sqrt{N}}\sqrt{\frac{V{\mathrm{log}}^{2}(N/V)}{{\theta}^{2}}+\mathrm{log}(1/\delta )}\right).$$
Here m is the margin, θ is any positive number, V is the Vapnik-Chervonenkis dimension of the classifier space, N is the size of the training set, and δ is a small positive number. The inequality holds with probability 1–δ over many i.i.d. training and test sets. This inequality says: To obtain a low generalization error, minimize the number of observations below margin θ in the training set.
TotalBoost minimizes a proxy of the Kullback-Leibler divergence between the current weight distribution and the initial weight distribution, subject to the constraint that the edge (the weighted margin) is below a certain value. The proxy is a quadratic expansion of the divergence:
$$D(W,{W}_{0})={\displaystyle \sum _{n=1}^{N}\mathrm{log}\frac{W(n)}{{W}_{0}(n)}}\approx {\displaystyle \sum _{n=1}^{N}\left(1+\frac{W(n)}{{W}_{0}(n)}\right)\Delta +\frac{1}{2W(n)}{\Delta}^{2}},$$
where Δ is the difference between W(n), the weights at the current and next iteration, and W_{0}, the initial weight distribution, which is uniform. This optimization formulation keeps weights from becoming zero. At each iteration, there are more constraints in the problem. So, for large problems, the optimization problem becomes increasingly constrained, and slow to solve.
TotalBoost typically creates ensembles with many learners having weights that are orders of magnitude smaller than those of other learners. Therefore, to better enable you to remove the unimportant ensemble members, the compact method reorders the members of a TotalBoost ensemble from largest weight to smallest. Therefore you can easily remove the least important members of the ensemble using the removeLearners method.
For examples using TotalBoost, see LPBoost and TotalBoost for Small Ensembles.
For references related to TotalBoost, see Warmuth, Liao, and Ratsch [18].