# loss

Classification error

## Syntax

```L = loss(tree,X,Y)L = loss(tree,X,Y,Name,Value)L = loss(tree,X,Y,'Subtrees',subtreevector)[L,se] = loss(tree,X,Y,'Subtrees',subtreevector)[L,se,NLeaf] = loss(tree,X,Y,'Subtrees',subtreevector)[L,se,NLeaf,bestlevel] = loss(tree,X,Y,'Subtrees',subtreevector)[L,...] = loss(tree,X,Y,'Subtrees',subtreevector,Name,Value)```

## Description

`L = loss(tree,X,Y)` returns a scalar representing how well `tree` classifies the data in `X`, when `Y` contains the true classifications.

When computing the loss, `loss` normalizes the class probabilities in `Y` to the class probabilities used for training, stored in the `Prior` property of `tree`.

`L = loss(tree,X,Y,Name,Value)` returns the loss with additional options specified by one or more `Name,Value` pair arguments.

`L = loss(tree,X,Y,'Subtrees',subtreevector)` returns a vector of classification errors for the trees in the pruning sequence `subtreevector`.

```[L,se] = loss(tree,X,Y,'Subtrees',subtreevector)``` returns the vector of standard errors of the classification errors.

 Note:   `loss` returns `se` and further outputs only when the `LossFun` name-value pair is the default `'classiferror'`.

```[L,se,NLeaf] = loss(tree,X,Y,'Subtrees',subtreevector)``` returns the vector of numbers of leaf nodes in the trees of the pruning sequence.

```[L,se,NLeaf,bestlevel] = loss(tree,X,Y,'Subtrees',subtreevector)``` returns the best pruning level as defined in the `TreeSize` name-value pair. By default, `bestlevel` is the pruning level that gives loss within one standard deviation of minimal loss.

`[L,...] = loss(tree,X,Y,'Subtrees',subtreevector,Name,Value)` returns loss statistics with additional options specified by one or more `Name,Value` pair arguments.

## Input Arguments

 `tree` A classification tree or compact classification tree constructed by `fitctree` or `compact`. `X` Matrix of data to classify. Each row of `X` represents one observation, and each column represents one predictor. `X` must have the same number of columns as the data used to train `tree`. `X` should have the same number of rows as the number of elements in `Y`. `Y` Classification of `X`. `Y` should be of the same type as the classification used to train `tree`, and its number of elements should equal the number of rows of `X`.

### Name-Value Pair Arguments

Specify optional comma-separated pairs of `Name,Value` arguments. `Name` is the argument name and `Value` is the corresponding value. `Name` must appear inside single quotes (`' '`). You can specify several name and value pair arguments in any order as `Name1,Value1,...,NameN,ValueN`.

 `'LossFun '` Function handle or string representing a loss function. Built-in loss functions: `'binodeviance'` — See Loss Functions`'classiferror'` — Fraction of misclassified observations. See Loss Functions.`'exponential'` — See Loss Functions`'hinge'` — See Loss Functions.`'mincost'` — Smallest misclassification cost as given by the `tree``.Cost` matrix. See Loss Functions. You can write your own loss function in the syntax described in Loss Functions. Default: `'mincost'` `'Weights'` A numeric vector of length `N`, where `N` is the number of rows of `X`. `Weights` are nonnegative. `loss` normalizes the weights so that observation weights in each class sum to the prior probability of that class. When you supply `Weights`, `loss` computes weighted classification loss. Default: `ones(N,1)`

`Name,Value` arguments associated with pruning subtrees:

 `'Subtrees'` A vector of nonnegative integers in ascending order or `'all'`. If you specify a vector, then all elements must be at least `0` and at most `max(tree.PruneList)`. `0` indicates the full, unpruned tree and `max(tree.PruneList)` indicates the a completely pruned tree (i.e., just the root node). If you specify `'all'`, then `CompactClassificationTree.loss` operates on all subtrees (i.e., the entire pruning sequence). This specification is equivalent to using `0:max(tree.PruneList)`. `CompactClassificationTree.loss` prunes `tree` to each level indicated in `Subtrees`, and then estimates the corresponding output arguments. The size of `Subtrees` determines the size of some output arguments. To invoke `Subtrees`, the properties `PruneList` and `PruneAlpha` of `tree` must be nonempty. In other words, grow `tree` by setting `'Prune','on'`, or by pruning `tree` using `prune`. Default: `0` `'TreeSize'` One of the following strings: `'se'` — `loss` returns the highest pruning level with loss within one standard deviation of the minimum (`L`+`se`, where `L` and `se` relate to the smallest value in `Subtrees`).`'min'` — `loss` returns the element of `Subtrees` with smallest loss, usually the smallest element of `Subtrees`.

## Output Arguments

 `L` Classification error, a vector the length of `Subtrees`. The meaning of the error depends on the values in `Weights` and `LossFun`; see Classification Error. `se` Standard error of loss, a vector the length of `Subtrees`. `NLeaf` Number of leaves (terminal nodes) in the pruned subtrees, a vector the length of `Subtrees`. `bestlevel` A scalar whose value depends on `TreeSize`: `TreeSize` = `'se'` — `loss` returns the highest pruning level with loss within one standard deviation of the minimum (`L`+`se`, where `L` and `se` relate to the smallest value in `Subtrees`).`TreeSize` = `'min'` — `loss` returns the element of `Subtrees` with smallest loss, usually the smallest element of `Subtrees`.

## Definitions

### Classification Error

The default classification error is the fraction of data `X` that `tree` misclassifies, where `Y` represents the true classifications.

Weighted classification error is the sum of weight i times the Boolean value that is `1` when `tree` misclassifies the ith row of `X`, divided by the sum of the weights.

### Loss Functions

The built-in loss functions are:

• `'binodeviance'` — For binary classification, assume the classes yn are `-1` and `1`. With weight vector w normalized to have sum `1`, and predictions of row n of data X as f(Xn), the binomial deviance is

$\sum {w}_{n}\mathrm{log}\left(1+\mathrm{exp}\left(-2{y}_{n}f\left({X}_{n}\right)\right)\right).$

• `'exponential'` — With the same definitions as for `'binodeviance'`, the exponential loss is

$\sum {w}_{n}\mathrm{exp}\left(-{y}_{n}f\left({X}_{n}\right)\right).$

• `'classiferror'` — Predict the label with the largest posterior probability. The loss is then the fraction of misclassified observations.

• `'hinge'` — Classification error measure that has the form

$L=\frac{\sum _{j=1}^{n}{w}_{j}\mathrm{max}\left\{0,1-{y}_{j}\prime f\left({X}_{j}\right)\right\}}{\sum _{j=1}^{n}{w}_{j}},$

where:

• wj is weight j.

• For binary classification, yj = 1 for the positive class and -1 for the negative class. For problems where the number of classes K > 3, yj is a vector of 0s, but with a 1 in the position corresponding to the true class, e.g., if the second observation is in the third class and K = 4, then y2 = [0 0 1 0]′.

• $f\left({X}_{j}\right)$ is, for binary classification, the posterior probability or, for K > 3, a vector of posterior probabilities for each class given observation j.

• `'mincost'` — Predict the label with the smallest expected misclassification cost, with expectation taken over the posterior probability, and cost as given by the `Cost` property of the classifier (a matrix). The loss is then the true misclassification cost averaged over the observations.

To write your own loss function, create a function file in this form:

`function loss = lossfun(C,S,W,COST)`
• `N` is the number of rows of `X`.

• `K` is the number of classes in the classifier, represented in the `ClassNames` property.

• `C` is an `N`-by-`K` logical matrix, with one `true` per row for the true class. The index for each class is its position in the `ClassNames` property.

• `S` is an `N`-by-`K` numeric matrix. `S` is a matrix of posterior probabilities for classes with one row per observation, similar to the `posterior` output from `predict`.

• `W` is a numeric vector with `N` elements, the observation weights. If you pass `W`, the elements are normalized to sum to the prior probabilities in the respective classes.

• `COST` is a `K`-by-`K` numeric matrix of misclassification costs. For example, you can use `COST = ones(K) - eye(K)`, which means a cost of `0` for correct classification, and `1` for misclassification.

• The output `loss` should be a scalar.

Pass the function handle `@lossfun` as the value of the `LossFun` name-value pair.

### True Misclassification Cost

There are two costs associated with classification: the true misclassification cost per class, and the expected misclassification cost per observation.

You can set the true misclassification cost per class in the `Cost` name-value pair when you create the classifier using the `fitctree` method. `Cost(i,j)` is the cost of classifying an observation into class `j` if its true class is `i`. By default, `Cost(i,j)=1` if `i~=j`, and `Cost(i,j)=0` if `i=j`. In other words, the cost is `0` for correct classification, and `1` for incorrect classification.

### Expected Misclassification Cost

There are two costs associated with classification: the true misclassification cost per class, and the expected misclassification cost per observation.

Suppose you have `Nobs` observations that you want to classify with a trained classifier. Suppose you have `K` classes. You place the observations into a matrix `Xnew` with one observation per row.

The expected cost matrix `CE` has size `Nobs`-by-`K`. Each row of `CE` contains the expected (average) cost of classifying the observation into each of the `K` classes. `CE(n,k)` is

$\sum _{i=1}^{K}\stackrel{^}{P}\left(i|Xnew\left(n\right)\right)C\left(k|i\right),$

where

• K is the number of classes.

• $\stackrel{^}{P}\left(i|Xnew\left(n\right)\right)$ is the posterior probability of class i for observation Xnew(n).

• $C\left(k|i\right)$ is the true misclassification cost of classifying an observation as k when its true class is i.

### Score (tree)

For trees, the score of a classification of a leaf node is the posterior probability of the classification at that node. The posterior probability of the classification at a node is the number of training sequences that lead to that node with the classification, divided by the number of training sequences that lead to that node.

For example, consider classifying a predictor `X` as `true` when `X` < `0.15` or `X` > `0.95`, and `X` is false otherwise.

Generate 100 random points and classify them:

```rng(0,'twister') % for reproducibility X = rand(100,1); Y = (abs(X - .55) > .4); tree = fitctree(X,Y); view(tree,'Mode','Graph') ```

Prune the tree:

```tree1 = prune(tree,'Level',1); view(tree1,'Mode','Graph') ```

The pruned tree correctly classifies observations that are less than 0.15 as `true`. It also correctly classifies observations from .15 to .94 as `false`. However, it incorrectly classifies observations that are greater than .94 as `false`. Therefore, the score for observations that are greater than .15 should be about .05/.85=.06 for `true`, and about .8/.85=.94 for `false`.

Compute the prediction scores for the first 10 rows of `X`:

```[~,score] = predict(tree1,X(1:10)); [score X(1:10,:)] ```
```ans = 0.9059 0.0941 0.8147 0.9059 0.0941 0.9058 0 1.0000 0.1270 0.9059 0.0941 0.9134 0.9059 0.0941 0.6324 0 1.0000 0.0975 0.9059 0.0941 0.2785 0.9059 0.0941 0.5469 0.9059 0.0941 0.9575 0.9059 0.0941 0.9649 ```

Indeed, every value of `X` (the right-most column) that is less than 0.15 has associated scores (the left and center columns) of `0` and `1`, while the other values of `X` have associated scores of `0.91` and `0.09`. The difference (score `0.09` instead of the expected `.06`) is due to a statistical fluctuation: there are `8` observations in `X` in the range `(.95,1)` instead of the expected `5` observations.

## Examples

collapse all

### Compute the In-sample Classification Error

Compute the resubstituted classification error for the `ionosphere` data set.

```load ionosphere tree = fitctree(X,Y); L = loss(tree,X,Y) ```
```L = 0.0114 ```

### Examine the Classification Error for Each Subtree

Unpruned decision trees tend to overfit. One way to balance model complexity and out-of-sample performance is to prune a tree (or restrict its growth) so that in-sample and out-of-sample performance are satisfactory.

Load Fisher's iris data set. Partition the data into training (50%) and validation (50%) sets.

```load fisheriris n = size(meas,1); rng(1) % For reproducibility idxTrn = false(n,1); idxTrn(randsample(n,round(0.5*n))) = true; % Training set logical indices idxVal = idxTrn == false; % Validation set logical indices ```

Grow a classification tree using the training set.

```Mdl = fitctree(meas(idxTrn,:),species(idxTrn)); ```

View the classification tree.

```view(Mdl,'Mode','graph'); ```

The classification tree has four pruning levels. Level 0 is the full, unpruned tree (as displayed). Level 3 is just the root node (i.e., no splits).

Examine the training sample classification error for each subtree (or pruning level) excluding the highest level.

```m = max(Mdl.PruneList) - 1; trnLoss = resubLoss(Mdl,'SubTrees',0:m) ```
```trnLoss = 0.0267 0.0533 0.3067 ```
• The full, unpruned tree misclassifies about 2.7% of the training observations.

• The tree pruned to level 1 misclassifies about 5.3% of the training observations.

• The tree pruned to level 2 (i.e., a stump) misclassifies about 30.6% of the training observations.

Examine the validation sample classification error at each level excluding the highest level.

```valLoss = loss(Mdl,meas(idxVal,:),species(idxVal),'SubTrees',0:m) ```
```valLoss = 0.0369 0.0237 0.3067 ```
• The full, unpruned tree misclassifies about 3.7% of the validation observations.

• The tree pruned to level 1 misclassifies about 2.4% of the validation observations.

• The tree pruned to level 2 (i.e., a stump) misclassifies about 30.7% of the validation observations.

To balance model complexity and out-of-sample performance, consider pruning `Mdl` to level 1.

```pruneMdl = prune(Mdl,'Level',1); view(pruneMdl,'Mode','graph') ```