MATLAB Examples

# Train Ensemble With Unequal Classification Costs

This example shows how to train an ensemble of classification trees with unequal classification costs. This example uses data on patients with hepatitis to see if they live or die as a result of the disease. The data set is described at UCI Machine Learning Data Repository.

Read the hepatitis data set from the UCI repository as a character array. Then convert the result to a cell array of character vectors using textscan. Specify a cell array of character vectors containing the variable names.

```hepatitis = textscan(urlread(['http://archive.ics.uci.edu/ml/' ... 'machine-learning-databases/hepatitis/hepatitis.data']),... '%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f','TreatAsEmpty','?',... 'Delimiter',','); size(hepatitis) VarNames = {'dieOrLive' 'age' 'sex' 'steroid' 'antivirals' 'fatigue' ... 'malaise' 'anorexia' 'liverBig' 'liverFirm' 'spleen' ... 'spiders' 'ascites' 'varices' 'bilirubin' 'alkPhosphate' 'sgot' ... 'albumin' 'protime' 'histology'}; ```
```ans = 1 20 ```

hepatitis is a 1-by-20 cell array of character vectors. The cells correspond to the response (liveOrDie) and 19 heterogeneous predictors.

Specify a numeric matrix containing the predictors and a cell vector containing 'Die' and 'Live', which are response categories. The response contains two values: 1 indicates that a patient died, and 2 indicates that a patient lived. Specify a cell array of character vectors for the response using the response categories. The first variable in hepatitis contains the response.

```X = cell2mat(hepatitis(2:end)); ClassNames = {'Die' 'Live'}; Y = ClassNames(hepatitis{:,1}); ```

X is a numeric matrix containing the 19 predictors. Y is a cell array of character vectors containing the response.

Inspect the data for missing values.

```figure; barh(sum(isnan(X),1)/size(X,1)); h = gca; h.YTick = 1:numel(VarNames) - 1; h.YTickLabel = VarNames(2:end); ylabel 'Predictor'; xlabel 'Fraction of missing values'; ```

Most predictors have missing values, and one has nearly 45% of the missing values. Therefore, use decision trees with surrogate splits for better accuracy. Because the data set is small, training time with surrogate splits should be tolerable.

Create a classification tree template that uses surrogate splits.

```rng(0,'twister') % for reproducibility t = templateTree('surrogate','all'); ```

Examine the data or the description of the data to see which predictors are categorical.

```X(1:5,:) ```
```ans = Columns 1 through 7 30.0000 2.0000 1.0000 2.0000 2.0000 2.0000 2.0000 50.0000 1.0000 1.0000 2.0000 1.0000 2.0000 2.0000 78.0000 1.0000 2.0000 2.0000 1.0000 2.0000 2.0000 31.0000 1.0000 NaN 1.0000 2.0000 2.0000 2.0000 34.0000 1.0000 2.0000 2.0000 2.0000 2.0000 2.0000 Columns 8 through 14 1.0000 2.0000 2.0000 2.0000 2.0000 2.0000 1.0000 1.0000 2.0000 2.0000 2.0000 2.0000 2.0000 0.9000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 0.7000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 0.7000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 1.0000 Columns 15 through 19 85.0000 18.0000 4.0000 NaN 1.0000 135.0000 42.0000 3.5000 NaN 1.0000 96.0000 32.0000 4.0000 NaN 1.0000 46.0000 52.0000 4.0000 80.0000 1.0000 NaN 200.0000 4.0000 NaN 1.0000 ```

It appears that predictors 2 through 13 are categorical, as well as predictor 19. You can confirm this inference using the data set description at UCI Machine Learning Data Repository.

List the categorical variables.

```catIdx = [2:13,19]; ```

Create a cross-validated ensemble using 150 learners and the GentleBoost algorithm.

```Ensemble = fitcensemble(X,Y,'Method','GentleBoost', ... 'NumLearningCycles',150,'Learners',t,'PredictorNames',VarNames(2:end), ... 'LearnRate',0.1,'CategoricalPredictors',catIdx,'KFold',5); figure; plot(kfoldLoss(Ensemble,'Mode','cumulative','LossFun','exponential')); xlabel('Number of trees'); ylabel('Cross-validated exponential loss'); ```

Inspect the confusion matrix to see which patients the ensemble predicts correctly.

```[yFit,sFit] = kfoldPredict(Ensemble); confusionmat(Y,yFit,'Order',ClassNames) ```
```ans = 19 13 11 112 ```

Of the 123 patient who live, the ensemble predicts correctly that 112 will live. But for the 32 patients who die of hepatitis, the ensemble only predicts correctly that about half will die of hepatitis.

There are two types of error in the predictions of the ensemble:

• Predicting that the patient lives, but the patient dies
• Predicting that the patient dies, but the patient lives

Suppose you believe that the first error is five times worse than the second. Create a new classification cost matrix that reflects this belief.

```cost.ClassNames = ClassNames; cost.ClassificationCosts = [0 5; 1 0]; ```

Create a new cross-validated ensemble using cost as the misclassification cost, and inspect the resulting confusion matrix.

```EnsembleCost = fitcensemble(X,Y,'Method','GentleBoost', ... 'NumLearningCycles',150,'Learners',t,'PredictorNames',VarNames(2:end), ... 'LearnRate',0.1,'CategoricalPredictors',catIdx,'KFold',5,'Cost',cost); [yFitCost,sFitCost] = kfoldPredict(EnsembleCost); confusionmat(Y,yFitCost,'Order',ClassNames) ```
```ans = 20 12 8 115 ```

As expected, the new ensemble does a better job classifying the patients who die. Somewhat surprisingly, the new ensemble also does a better job classifying the patients who live, though the result is not statistically significantly better. The results of the cross validation are random, so this result is simply a statistical fluctuation. The result seems to indicate that the classification of patients who live is not very sensitive to the cost.