Quantcast

Documentation Center

  • Trial Software
  • Product Updates

fitensemble

Fitted ensemble for classification or regression

Syntax

Ensemble = fitensemble(X,Y,Method,NLearn,Learners)
Ensemble = fitensemble(X,Y,Method,NLearn,Learners,Name,Value)

Description

Ensemble = fitensemble(X,Y,Method,NLearn,Learners) creates an ensemble model that predicts responses to data. The ensemble consists of models listed in Learners.

Ensemble = fitensemble(X,Y,Method,NLearn,Learners,Name,Value) creates an ensemble model with additional options specified by one or more Name,Value pair arguments. You can specify several name-value pair arguments in any order as Name1,Value1,…,NameN,ValueN.

Input Arguments

X

Matrix of predictor values. Each column of X represents one variable, and each row represents one observation.

Y

For classification, Y is a categorical variable, character array, or cell array of strings. Each row of Y represents the classification of the corresponding row of X.

For regression, Y is a numeric column vector with the same number of rows as X. Each entry in Y is the response to the data in the corresponding row of X.

Method

Case-insensitive string consisting of one of the following.

  • For classification with two classes:

    • 'AdaBoostM1'

    • 'LogitBoost'

    • 'GentleBoost'

    • 'RobustBoost' (requires an Optimization Toolbox™ license)

    • 'LPBoost' (requires an Optimization Toolbox license)

    • 'TotalBoost' (requires an Optimization Toolbox license)

    • 'RUSBoost'

    • 'Subspace'

    • 'Bag'

  • For classification with three or more classes:

    • 'AdaBoostM2'

    • 'LPBoost' (requires an Optimization Toolbox license)

    • 'TotalBoost' (requires an Optimization Toolbox license)

    • 'RUSBoost'

    • 'Subspace'

    • 'Bag'

  • For regression:

    • 'LSBoost'

    • 'Bag'

'Bag' applies to all methods. So when you use 'Bag', indicate whether you want a classifier or regressor with the type name-value pair set to 'classification' or 'regression'.

NLearn

Number of ensemble learning cycles, a positive integer (or the string 'AllPredictorCombinations', see the next paragraph). At every training cycle, fitensemble loops over all learner templates in Learners and trains one weak learner for every template. The total number of trained learners in Ensemble is NLearn*numel(Learners).

If you set Method to 'Subspace', you can set NLearn to 'AllPredictorCombinations'. With this setting, fitensemble constructs learners for all possible combinations of predictors taken NPredToSample at a time. This gives a total of nchoosek(size(X,2),NPredToSample) learners in the ensemble. You can use only one learner template for this setting.

NLearn for ensembles can vary from a few dozen to a few thousand. Usually, an ensemble with a good predictive power needs from a few hundred to a few thousand weak learners. You do not have to train an ensemble for that many cycles at once. You can start by growing a few dozen learners, inspect the ensemble performance and, if necessary, train more weak learners using the resume method of the ensemble.

Learners

One of the following:

  • A string with the name of a weak learner:

    • 'Discriminant' (applies only to 'Subspace')

    • 'KNN' (applies only to 'Subspace')

    • 'Tree' (applies to all methods except 'Subspace')

  • A single weak learner template you create with templateTree, templateKNN, or templateDiscriminant.

  • A cell array of weak learner templates. Usually you should supply only one weak learner template.

Ensemble performance depends on the parameters of the weak learners, and you can get poor performance using weak learners with default parameters. Specify the parameters for the weak learners in the template. Specify parameters for the ensemble in the fitensemble name-value pairs.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

All Ensembles:

'CategoricalPredictors'

List of categorical predictors. Pass CategoricalPredictors as one of:

  • A numeric vector with indices from 1 to p, where p is the number of columns of X.

  • A logical vector of length p, where a true entry means that the corresponding column of X is a categorical variable.

  • 'All', meaning all predictors are categorical.

  • A cell array of strings, where each element in the array is the name of a predictor variable. The names must match entries in the PredictorNames property.

  • A character matrix, where each row of the matrix is the name of a predictor variable. The names must match entries in the PredictorNames property. Pad the names with extra blanks so each row of the character matrix has the same length.

You can set CategoricalPredictors for these learners:

  • 'Tree'

  • 'KNN', when all predictors are categorical

Default: []

'CrossVal'

If 'On', grows a cross-validated learner with 10 folds. You can use 'KFold', 'Holdout', 'Leaveout', or 'CVPartition' parameters to override this cross-validation setting. You can only use one of these four parameters ('KFold', 'Holdout', 'Leaveout', or 'CVPartition') at a time when creating a cross-validated learner.

Default: 'Off'

'CVPartition'

Partition created with cvpartition to use in a cross-validated learner. You can only use one of these four options at a time: 'KFold', 'Holdout', 'Leaveout', or 'CVPartition'.

'FResample'

Fraction of the training set to be selected by resampling for every weak learner. A numeric scalar from 0 to 1. This parameter has no effect unless you grow an ensemble by bagging or set 'resample' to 'on'. The default setting is the one used most often for an ensemble grown by resampling.

Default: 1

'Holdout'

Holdout validation tests the specified fraction of the data, and uses the remaining data for training. Specify a numeric scalar from 0 to 1. You can only use one of these four options at a time for creating a cross-validated learner: 'KFold', 'Holdout', 'Leaveout', or 'CVPartition'.

'KFold'

Number of folds to use in a cross-validated learner, a positive integer. You can only use one of these four options at a time: 'KFold', 'Holdout', 'Leaveout', or 'CVPartition'.

Default: 10

'Leaveout'

Use leave-one-out cross validation by setting to 'on'. You can only use one of these four options at a time: 'KFold', 'Holdout', 'Leaveout', or 'CVPartition'.

'NPredToSample'

Number of predictors in each random subspace learner, a positive integer from 1 to size(X,2).

Default: 1

'NPrint'

Printout frequency, a positive integer scalar. Set to 'Off' for no printout. Use this parameter to track how many weak learners have been trained so far. This is useful when you train ensembles with many learners on large datasets. If you use one of the cross-validation options, this parameter defines the printout frequency per number of cross-validation folds.

Default: 'Off'

'PredictorNames'

Cell array of names for the predictor variables, in the order in which they appear in X.

Default: {'x1','x2',...}

'Replace'

'On' or 'Off'. If 'On', sample with replacement. If 'Off', sample without replacement. This parameter has no effect unless you grow an ensemble by bagging or set Resample to 'On'. If you set Resample to 'On' and Replace to 'Off', fitensemble samples training observations assuming uniform weights, and boosts by reweighting observations.

Default: 'On'

'Resample'

'On' or 'Off'. If 'On', grow an ensemble by resampling, with the resampling fraction given by FResample, and sampling with or without replacement given by Replace.

  • Boosting — When 'Off', the boosting algorithm reweights observations at every learning iteration. When 'On', the algorithm samples training observations using updated weights as the multinomial sampling probabilities.

  • Bagging — You can use only the default value of this parameter ('On').

Default: 'Off' for boosting, 'On' for bagging

'ResponseName'

Name of the response variable Y, a string.

Default: 'Y'

'Type'

String, either 'Classification' or 'Regression'. Specify Type when the Method is 'Bag'.

'Weights'

Vector of observation weights. The length of Weights is the number of rows in X.

Default: ones(size(X,1),1)

Classification Ensembles:

'ClassNames'

Array of class names. Specify a data type the same as exists in Y.

Default: Class names that exist in Y

'Cost'

Square matrix C, where C(i,j) is the cost of classifying a point into class j if its true class is i. Alternatively, cost can be a structure S having two fields:

  • S.ClassNames containing the group names as a categorical variable, character array, or cell array of strings

  • S.ClassificationCosts containing the cost matrix C

If Method is Bag, Type is Classification, and Cost is highly skewed, then, for in-bag samples, the software oversamples unique observations from the class that has a large penalty. For smaller sample sizes, this might cause a very low relative frequency of out-of-bag observations from the class that has a large penalty. Therefore, the estimated out-of-bag error is highly variable, and might be difficult to interpret.

Default: C(i,j) = 1 if i ~= j, and C(i,j) = 0 if i = j

'Prior'

Prior probabilities for each class. Specify as one of:

  • A string:

    • 'Empirical' determines class probabilities from class frequencies in Y. If you pass observation weights, they are used to compute the class probabilities.

    • 'Uniform' sets all class probabilities equal.

  • A vector (one scalar value for each class)

  • A structure S with two fields:

    • S.ClassNames containing the class names as a categorical variable, character array, or cell array of strings

    • S.ClassProbs containing a vector of corresponding probabilities

If you set values for both Weights and Prior, the weights are renormalized to add up to the value of the prior probability in the respective class.

If Method is Bag, Type is Classification, and Prior is highly skewed, then, for in-bag samples, the software oversamples unique observations from the class that has a large prior probability. For smaller sample sizes, this might cause a very low relative frequency of out-of-bag observations from the class that has a large prior probability. Therefore, the estimated out-of-bag error is highly variable, and might be difficult to interpret.

Default: 'Empirical'

AdaBoostM1, AdaBoostM2, LogitBoost, GentleBoost, RUSBoost, and LSBoost:

'LearnRate'

Learning rate for shrinkage, a numeric scalar from 0 to 1. If you set the learning rate to less than 1, the ensemble requires more learning iterations but often achieves a better accuracy. 0.1 is a popular choice for an ensemble grown with shrinkage.

Default: 1

RUSBoost:

'RatioToSmallest'

Either a numeric scalar or vector with K elements when there are K classes. Every element of this vector is the sampling proportion for this class with respect to the class with fewest observations in Y. If you pass a scalar, fitensemble uses this sampling proportion for all classes. For example, suppose you have class A with 100 observations and class B with 10 observations. If you pass [2 1] for 'RatioToSmallest', every learner in the ensemble is trained on 20 observations of class A and 10 observations of class B. If you pass 2 or [2 2], every learner is trained on 20 observations of class A and 20 observations of class B. If you pass 'ClassNames', fitensemble matches elements in the array of class names to elements in this vector.

Default: ones(K,1)

LPBoost and TotalBoost:

'MarginPrecision'

Margin precision, a numeric scalar between 0 and 1. MarginPrecision affects the number of boosting iterations required for conversion. Use a small value to grow an ensemble with many learners, and use a large value to grow an ensemble with few learners.

Default: 0.01

RobustBoost:

'RobustErrorGoal'

Target classification error for RobustBoost, a numeric scalar from 0 to 1. Usually there is an optimal range for this parameter for your training data. If you set the error goal too low or too high, RobustBoost can produce a model with poor classification accuracy.

Default: 0.1

'RobustMarginSigma'

Spread of the distribution of classification margins over the training set for RobustBoost, a numeric positive scalar. You should consult literature on RobustBoost before setting this parameter

Default: 0.1

'RobustMaxMargin'

Maximal classification margin for RobustBoost in the training set, a nonnegative numeric scalar. RobustBoost minimizes the number of observations in the training set with classification margins below RobustMaxMargin.

Default: 0

Output Arguments

Ensemble

Ensemble object for predicting characteristics. The class of Ensemble depends on settings. In the following table, cross-validation names are CrossVal, 'KFold', 'Holdout', 'Leaveout', or 'CVPartition'.

SettingsClass
Resample name-value pair is 'Off', and you don't set a cross-validation name-value pair argument.ClassificationEnsemble
Resample name-value pair is 'Off', and you don't set a cross-validation name-value pair argument.RegressionEnsemble
Resample name-value pair is 'On', type is 'classification', and you don't set a cross-validation name-value pair argument.ClassificationBaggedEnsemble
Resample name-value pair is 'On', type is 'regression', and you don't set a cross-validation name-value pair argument.RegressionBaggedEnsemble
Method is a classification method, and you set a cross-validation name-value pair argument.ClassificationPartitionedEnsemble
Method is a regression method, and you set a cross-validation name-value pair argument.RegressionPartitionedEnsemble

Examples

expand all

Estimate the Resubstitution Loss of a Boosting Ensemble

Estimate the resubstitution loss of a trained, boosting classification ensemble of decision trees.

Load the ionosphere data set.

load ionosphere;

Train a decision tree ensemble using AdaBoost, 100 learning cycles, and the entire data set.

ClassTreeEns = fitensemble(X,Y,'AdaBoostM1',100,'Tree');

ClassTreeEns is a trained ClassificationEnsemble ensemble classifier.

Determine the cumulative resubstitution losses (i.e., the cumulative misclassification error of the labels in the training data).

rsLoss = resubLoss(ClassTreeEns,'Mode','Cumulative');

rsLoss is a 100-by-1 vector, where element k contains the resubstition loss after the first k learning cycles.

Plot the cumulative resubstitution loss over the number of learning cycles.

plot(rsLoss);
xlabel('Number of Learning Cycles');
ylabel('Resubstitution Loss');

In general, as the number of decision trees in the trained classification ensemble increases, the resubstitution loss decreases.

A decrease in resubstitution loss might indicate that the software trained the ensemble sensibly. However, you cannot infer the predictive power of the ensemble by this decrease. To measure the predictive power of an ensemble, estimate the generalization error by:

  1. Randomly partitioning the data into training and cross-validation sets. Do this by specifying 'holdout',holdoutProportion when you train the ensemble using fitensemble.

  2. Passing the trained ensemble to kfoldLoss, which estimates the generalization error.

Train a Regression Ensemble

Use a trained, boosted regression tree ensemble to predict the fuel economy of a car. Choose the number of cylinders, volume displaced by the cylinders, horsepower, and weight as predictors.

Load the carsmall data set. Set the predictors to X.

load carsmall
X = [Cylinders,Displacement,Horsepower,Weight];
xnames = {'Cylinders','Displacement','Horsepower','Weight'};

Specify a regression tree template that uses surrogate splits to impove predictive accuracy in the presence of NaN values.

RegTreeTemp = templateTree('Surrogate','On');

Train the regression tree ensemble using LSBoost and 100 learning cycles.

RegTreeEns = fitensemble(X,MPG,'LSBoost',100,RegTreeTemp,...
    'PredictorNames',xnames);

RegTreeEns is a trained RegressionEnsemble regression ensemble.

Use the trained regression ensemble to predict the fuel economy for a four-cylinder car with a 200-cubic inch displacement, 150 horsepower, and weighing 3000 lbs.

predMPG = predict(RegTreeEns,[4 200 150 3000])
predMPG =

   21.7781

The average fuel economy of a car with the specifications is 21.78 mpg.

Estimate the Generalization Error of a Boosting Ensemble

Estimate the generalization error of a trained, boosting classification ensemble of decision trees.

Load the ionosphere data set.

load ionosphere;

Train a decision tree ensemble using AdaBoostM1, 100 learning cycles, and half of the data chosen randomly. The software validates the algorithm using the remaining half.

rng(2); % For reproducibility
ClassTreeEns = fitensemble(X,Y,'AdaBoostM1',100,'Tree',...
    'Holdout',0.5);

ClassTreeEns is a trained ClassificationEnsemble ensemble classifier.

Determine the cumulative generalization error, i.e., the cumulative misclassification error of the labels in the validation data).

genError = kfoldLoss(ClassTreeEns,'Mode','Cumulative');

genError is a 100-by-1 vector, where element k contains the generalization error after the first k learning cycles.

Plot the generalization error over the number of learning cycles.

plot(genError);
xlabel('Number of Learning Cycles');
ylabel('Generalization Error');

The cumulative generalization error decreases to approximately 7% when 25 weak learners compose the ensemble classifier.

More About

expand all

Tips

Avoid large estimated out-of-bag error variances by setting a more balanced misclassification cost matrix or a less skewed prior probability vector. This is particularly important if you train using a small sample size.

Algorithms

  • For details of boosting and bagging algorithms, see Ensemble Algorithms.

  • fitensemble generates in-bag samples by oversampling classes with large misclassification costs and undersampling classes with small misclassification costs. Consequently, out-of-bag samples have fewer observations from classes with large misclassification costs and more observations from classes with small misclassification costs. If you train a classification ensemble using a small data set and a highly skewed cost matrix, then the number of out-of-bag observations per class might be very low. Therefore, the estimated out-of-bag error might have a large variance and might be difficult to interpret. The same phenomenon can occur for classes with large prior probabilities.

See Also

| | | | | | | |

Was this topic helpful?