Classification loss functions
measure the predictive inaccuracy of classification models. When comparing
the same type of loss among many models, lower loss indicates a better
predictive model.
Suppose that:
L is the weighted average classification
loss.
n is the sample size.
For binary classification:
y_{j} is the
observed class label. The software codes it as –1 or 1 indicating
the negative or positive class, respectively.
f(X_{j})
is the raw classification score for observation (row) j of
the predictor data X.
m_{j} = y_{j}f(X_{j})
is the classification score for classifying observation j into
the class corresponding to y_{j}.
Positive values of m_{j} indicate
correct classification and do not contribute much to the average loss.
Negative values of m_{j} indicate
incorrect classification and contribute to the average loss.
For algorithms that support multiclass classification
(that is, K ≥ 3):
y_{j}^{*} is
a vector of K – 1 zeros, and a 1 in the
position corresponding to the true, observed class y_{j}.
For example, if the true class of the second observation is the third
class and K = 4, then y^{*}_{2} =
[0 0 1 0]′. The order of the classes corresponds to the order
in the ClassNames
property of the input model.
f(X_{j})
is the length K vector of class scores for observation j of
the predictor data X. The order of the scores corresponds
to the order of the classes in the ClassNames
property
of the input model.
m_{j} = y_{j}^{*}′f(X_{j}).
Therefore, m_{j} is the scalar
classification score that the model predicts for the true, observed
class.
The weight for observation j is w_{j}.
The software normalizes the observation weights so that they sum to
the corresponding prior class probability. The software also normalizes
the prior probabilities so they sum to 1. Therefore,
The supported loss functions are:
Binomial deviance, specified using 'LossFun','binodeviance'
.
Its equation is
Exponential loss, specified using 'LossFun','exponential'
.
Its equation is
Classification error, specified using 'LossFun','classiferror'
.
It is the weighted fraction of misclassified observations, with equation
$${\widehat{y}}_{j}$$ is
the class label corresponding to the class with the maximal posterior
probability. I{x} is the indicator
function.Hinge loss, specified using 'LossFun','hinge'
.
Its equation is
Logit loss, specified using 'LossFun','logit'
.
Its equation is
Minimal cost, specified using 'LossFun','mincost'
.
The software computes the weighted minimal cost using this procedure
for observations j = 1,...,n:
Estimate the 1-by-K vector of expected
classification costs for observation j
f(X_{j})
is the column vector of class posterior probabilities for binary and
multiclass classification. C is the cost matrix
the input model stores in the property Cost
.For observation j, predict the
class label corresponding to the minimum, expected classification
cost:
Using C, identify the cost incurred
(c_{j}) for making the prediction.
The weighted, average, minimum cost loss is
Quadratic loss, specified using 'LossFun','quadratic'
.
Its equation is
This figure compares some of the loss functions for one observation
over m (some functions are normalized to pass through
[0,1]).