Classification loss functions measure the predictive
inaccuracy of classification models. When you compare the same type of loss among many
models, a lower loss indicates a better predictive model.
Consider the following scenario.
L is the weighted average classification loss.
n is the sample size.
For binary classification:
y_{j} is the observed class
label. The software codes it as –1 or 1, indicating the negative or
positive class (or the first or second class in the
ClassNames
property), respectively.
f(X_{j})
is the positive-class classification score for observation (row)
j of the predictor data
X.
m_{j} =
y_{j}f(X_{j})
is the classification score for classifying observation
j into the class corresponding to
y_{j}. Positive
values of m_{j} indicate
correct classification and do not contribute much to the average
loss. Negative values of
m_{j} indicate incorrect
classification and contribute significantly to the average
loss.
For algorithms that support multiclass classification (that is, K ≥ 3):
y_{j}^{*}
is a vector of K – 1 zeros, with 1 in the
position corresponding to the true, observed class
y_{j}. For example,
if the true class of the second observation is the third class and K = 4, then y_{2}^{*}
= [0 0 1 0]′. The order of the classes corresponds to the order
in the ClassNames
property of the input
model.
f(X_{j})
is the length K vector of class scores for
observation j of the predictor data
X. The order of the scores corresponds to the
order of the classes in the ClassNames
property
of the input model.
m_{j} =
y_{j}^{*}′f(X_{j}). Therefore,
m_{j} is the scalar
classification score that the model predicts for the true, observed
class.
The weight for observation j is
w_{j}. The software normalizes
the observation weights so that they sum to the corresponding prior class
probability. The software also normalizes the prior probabilities so they sum to
1. Therefore,
Given this scenario, the following table describes the supported loss
functions that you can specify by using the 'LossFun'
name-value pair
argument.
Loss Function | Value of LossFun | Equation |
---|
Binomial deviance | 'binodeviance' | $$L={\displaystyle \sum _{j=1}^{n}{w}_{j}\mathrm{log}\left\{1+\mathrm{exp}\left[-2{m}_{j}\right]\right\}}.$$ |
Misclassified rate in decimal | 'classiferror' | $$L={\displaystyle \sum _{j=1}^{n}{w}_{j}}I\left\{{\widehat{y}}_{j}\ne {y}_{j}\right\}.$$ $${\widehat{y}}_{j}$$ is the class label corresponding to the class with the
maximal score. I{·} is the
indicator function. |
Cross-entropy loss | 'crossentropy' | 'crossentropy' is appropriate only for neural
network models.
The weighted cross-entropy loss is
where the weights $${\tilde{w}}_{j}$$ are normalized to sum to n instead
of 1. |
Exponential loss | 'exponential' | $$L={\displaystyle \sum _{j=1}^{n}{w}_{j}\mathrm{exp}\left(-{m}_{j}\right)}.$$ |
Hinge loss | 'hinge' | $$L={\displaystyle \sum}_{j=1}^{n}{w}_{j}\mathrm{max}\left\{0,1-{m}_{j}\right\}.$$ |
Logit loss | 'logit' | $$L={\displaystyle \sum _{j=1}^{n}{w}_{j}\mathrm{log}\left(1+\mathrm{exp}\left(-{m}_{j}\right)\right)}.$$ |
Minimal expected misclassification cost | 'mincost' | 'mincost' is appropriate only if classification
scores are posterior probabilities.
The software computes
the weighted minimal expected classification cost using this procedure
for observations j = 1,...,n.
Estimate the expected misclassification cost of
classifying the observation
X_{j} into
the class k:
f(X_{j})
is the column vector of class posterior probabilities for
binary and multiclass classification for the observation
X_{j}.
C is the cost matrix stored in the
Cost property of the model. For observation j, predict the class
label corresponding to the minimal expected
misclassification cost:
Using C, identify the cost incurred
(c_{j}) for
making the prediction.
The weighted average of the minimal expected
misclassification cost loss is
If you use the default cost matrix (whose element
value is 0 for correct classification and 1 for incorrect
classification), then the 'mincost' loss is
equivalent to the 'classiferror' loss. |
Quadratic loss | 'quadratic' | $$L={\displaystyle \sum _{j=1}^{n}{w}_{j}{\left(1-{m}_{j}\right)}^{2}}.$$ |
This figure compares the loss functions (except 'crossentropy'
and
'mincost'
) over the score m for one observation.
Some functions are normalized to pass through the point (0,1).