MATLAB Examples

Classification with Many Categorical Levels

This example shows how to train an ensemble of classification trees using data containing predictors with many categorical levels.

Generally, you cannot use classification with more than 31 levels in any categorical predictor. However, two boosting algorithms can classify data with many categorical predictor levels and binary responses: LogitBoost and GentleBoost. For details, see LogitBoost and GentleBoost.

The example uses the census1994 data set, which consists of demographic data from the U.S. Census Bureau to predict whether an individual makes over $50,000 a year.

Load the census1994 file. Specify a cell array of character vectors containing the variable names.

load census1994
VarNames = adultdata.Properties.VariableNames;

Some variable names in the adultdata table contain the _ character. Replace instances of _ with a space.

VarNames = strrep(VarNames,'_',' ');

Some categorical variables have many levels. Plot the number of levels of each categorical predictor.

cat = ~varfun(@isnumeric,adultdata(:,1:end - 1),...
    'OutputFormat','uniform'); % Logical flag for categorical variables
catVars = find(cat);           % Indices of categorical variables

countCats = @(var)numel(categories(nominal(var)));
numCat = varfun(@(var)countCats(var),adultdata(:,catVars),...
    'OutputFormat','uniform');

figure
barh(numCat);
h = gca;
h.YTickLabel = VarNames(catVars);
ylabel 'Predictor'
xlabel 'Number of categories'

The anonymous function countCats converts a predictor to a nominal array, then counts the unique, nonempty categories of the predictor. Predictor 14 ('native country') has more than 40 categorical levels. For binary classification, docid:stats_ug.bt6cr7t uses a computational shortcut to find an optimal split for categorical predictors with many categories. For classification with more than two classes, you can choose a heuristic algorithm to find a good split. For details, see docid:stats_ug.btttehe.

Specify the predictor matrix using classreg.regr.modelutils.predictormatrix and the response vector.

X = classreg.regr.modelutils.predictormatrix(adultdata,'ResponseVar',...
    size(adultdata,2));
Y = nominal(adultdata.salary);

X is a numeric matrix; predictormatrix converts all categorical variables into group indices. The name-value pair argument ResponseVar indicates that the last column is the response variable, and excludes it from the predictor matrix. Y is a nominal, categorical array.

Train classification ensembles using both LogitBoost and GentleBoost.

rng(1); % For reproducibility
LBEnsemble = fitcensemble(X,Y,'Method','LogitBoost', ...
    'NumLearningCycles',300,'Learners','Tree',...
    'CategoricalPredictors',cat,'PredictorNames',VarNames(1:end-1),...
    'ResponseName','income');
GBEnsemble = fitcensemble(X,Y,'Method','GentleBoost', ...
    'NumLearningCycles',300,'Learners','Tree',...
    'CategoricalPredictors',cat,'PredictorNames',VarNames(1:end-1),...
    'ResponseName','income');

Examine the resubstitution error for both ensembles.

figure
plot(resubLoss(LBEnsemble,'Mode','cumulative'))
hold on
plot(resubLoss(GBEnsemble,'Mode','cumulative'),'r--')
hold off
xlabel('Number of trees')
ylabel('Resubstitution error')
legend('LogitBoost','GentleBoost','Location','NE')

For both ensembles, the resubstitution error continues to decrease as the number of trees increases.

Estimate the generalization error for both algorithms by cross validation.

CVLBEnsemble = crossval(LBEnsemble,'KFold',5);
CVGBEnsemble = crossval(GBEnsemble,'KFold',5);
figure
plot(kfoldLoss(CVLBEnsemble,'Mode','cumulative'))
hold on
plot(kfoldLoss(CVGBEnsemble,'Mode','cumulative'),'r--')
hold off
xlabel('Number of trees')
ylabel('Cross-validated error')
legend('LogitBoost','GentleBoost','Location','NE')

As the number of trees increases to 300, the cross-validated loss decreases up to a point, and then starts to increase. A guideline for determining the number of trees to use in the ensemble is to stop adding more trees when the validation error stops improving and starts increasing. This happens after about 50 trees for LogitBoost and 25 trees for GentleBoost.