After a classification algorithm such as ClassificationNaiveBayes
or TreeBagger
has trained on data, you may want to
examine the performance of the algorithm on a specific test dataset. One common way of doing
this would be to compute a gross measure of performance such as quadratic loss or accuracy,
averaged over the entire test dataset.
You may want to inspect the classifier performance more closely, for example, by plotting a Receiver Operating Characteristic (ROC) curve. By definition, a ROC curve [1,2] shows true positive rate versus false positive rate (equivalently, sensitivity versus 1–specificity) for different thresholds of the classifier output. You can use it, for example, to find the threshold that maximizes the classification accuracy or to assess, in more broad terms, how the classifier performs in the regions of high sensitivity and high specificity.
perfcurve
perfcurve
computes measures
for a plot of classifier performance. You can use this utility to
evaluate classifier performance on test data after you train the classifier.
Various measures such as mean squared error, classification error,
or exponential loss can summarize the predictive power of a classifier
in a single number. However, a performance curve offers more information
as it lets you explore the classifier performance across a range of
thresholds on its output.
You can use perfcurve
with any classifier
or, more broadly, with any method that returns a numeric score for
an instance of input data. By convention adopted here,
A high score returned by a classifier for any given instance signifies that the instance is likely from the positive class.
A low score signifies that the instance is likely from the negative classes.
For some classifiers, you can interpret the score as
the posterior probability of observing an instance of the positive
class at point X
. An example of such a score is
the fraction of positive observations in a leaf of a decision tree.
In this case, scores fall into the range from 0 to 1 and scores from
positive and negative classes add up to unity. Other methods can return
scores ranging between minus and plus infinity, without any obvious
mapping from the score to the posterior class probability.
perfcurve
does not impose any requirements
on the input score range. Because of this lack of normalization, you
can use perfcurve
to process scores returned by
any classification, regression, or fit method. perfcurve
does
not make any assumptions about the nature of input scores or relationships
between the scores for different classes. As an example, consider
a problem with three classes, A
, B
,
and C
, and assume that the scores returned by some
classifier for two instances are as follows:
A | B | C | |
instance 1 | 0.4 | 0.5 | 0.1 |
instance 2 | 0.4 | 0.1 | 0.5 |
If you want to compute a performance curve for separation of
classes A
and B
, with C
ignored,
you need to address the ambiguity in selecting A
over B
.
You could opt to use the score ratio, s(A)/s(B)
,
or score difference, s(A)-s(B)
; this choice could
depend on the nature of these scores and their normalization. perfcurve
always
takes one score per instance. If you only supply scores for class A
, perfcurve
does
not distinguish between observations 1 and 2. The performance curve
in this case may not be optimal.
perfcurve
is intended for
use with classifiers that return scores, not those that return only
predicted classes. As a counter-example, consider a decision tree
that returns only hard classification labels, 0 or 1, for data with
two classes. In this case, the performance curve reduces to a single
point because classified instances can be split into positive and
negative categories in one way only.
For input, perfcurve
takes true class labels
for some data and scores assigned by a classifier to these data. By
default, this utility computes a Receiver Operating Characteristic
(ROC) curve and returns values of 1–specificity, or false positive
rate, for X
and sensitivity, or true positive rate,
for Y
. You can choose other criteria for X
and Y
by
selecting one out of several provided criteria or specifying an arbitrary
criterion through an anonymous function. You can display the computed
performance curve using plot(X,Y)
.
perfcurve
can compute values for various
criteria to plot either on the x- or the y-axis.
All such criteria are described by a 2-by-2 confusion matrix, a 2-by-2
cost matrix, and a 2-by-1 vector of scales applied to class counts.
The confusionchart
matrix,
C
, is defined as
where
P stands for "positive".
N stands for "negative".
T stands for "true".
F stands for "false".
For example, the first row of the confusion matrix defines how the classifier
identifies instances of the positive class: C(1,1)
is the count of correctly
identified positive instances and C(1,2)
is the count of positive instances
misidentified as negative.
The cost matrix defines the cost of misclassification for each category:
where Cost(I|J)
is
the cost of assigning an instance of class J
to
class I
. Usually Cost(I|J)=0
for I=J
.
For flexibility, perfcurve
allows you to specify
nonzero costs for correct classification as well.
The two scales include prior information about class probabilities. perfcurve
computes
these scales by taking scale(P)=prior(P)*N
and scale(N)=prior(N)*P
and
normalizing the sum scale(P)+scale(N)
to 1. P=TP+FN
and N=TN+FP
are
the total instance counts in the positive and negative class, respectively.
The function then applies the scales as multiplicative factors to
the counts from the corresponding class: perfcurve
multiplies
counts from the positive class by scale(P)
and
counts from the negative class by scale(N)
. Consider,
for example, computation of positive predictive value, PPV
= TP/(TP+FP)
. TP
counts come from the
positive class and FP
counts come from the negative
class. Therefore, you need to scale TP
by scale(P)
and FP
by scale(N)
,
and the modified formula for PPV
with prior probabilities
taken into account is now:
If all scores in the
data are above a certain threshold, perfcurve
classifies
all instances as 'positive'
. This means that TP
is
the total number of instances in the positive class and FP
is
the total number of instances in the negative class. In this case, PPV
is
simply given by the prior:
The perfcurve
function
returns two vectors, X
and Y
,
of performance measures. Each measure is some function of confusion
, cost
,
and scale
values. You can request specific measures
by name or provide a function handle to compute a custom measure.
The function you provide should take confusion
, cost
,
and scale
as its three inputs and return a vector
of output values.
The criterion for X
must be a monotone function
of the positive classification count, or equivalently, threshold for
the supplied scores. If perfcurve
cannot perform
a one-to-one mapping between values of the X
criterion
and score thresholds, it exits with an error message.
By default, perfcurve
computes values of
the X
and Y
criteria for all
possible score thresholds. Alternatively, it can compute a reduced
number of specific X
values supplied as an input
argument. In either case, for M
requested values, perfcurve
computes M+1
values
for X
and Y
. The first value
out of these M+1
values is special. perfcurve
computes
it by setting the TP
instance count to zero and
setting TN
to the total count in the negative class.
This value corresponds to the 'reject all'
threshold.
On a standard ROC curve, this translates into an extra point placed
at (0,0)
.
If there are NaN
values among input scores, perfcurve
can
process them in either of two ways:
It can discard rows with NaN
scores.
It can add them to false classification counts in the respective class.
That is, for any threshold, instances with NaN
scores
from the positive class are counted as false negative (FN
),
and instances with NaN
scores from the negative
class are counted as false positive (FP
). In this
case, the first value of X
or Y
is
computed by setting TP
to zero and setting TN
to
the total count minus the NaN
count in the negative
class. For illustration, consider an example with two rows in the
positive and two rows in the negative class, each pair having a NaN
score:
Class | Score |
---|---|
Negative | 0.2 |
Negative | NaN |
Positive | 0.7 |
Positive | NaN |
If you discard rows with NaN
scores,
then as the score cutoff varies, perfcurve
computes
performance measures as in the following table. For example, a cutoff
of 0.5 corresponds to the middle row where rows 1 and 3 are classified
correctly, and rows 2 and 4 are omitted.
TP | FN | FP | TN |
0 | 1 | 0 | 1 |
1 | 0 | 0 | 1 |
1 | 0 | 1 | 0 |
If you add rows with NaN
scores
to the false category in their respective classes, perfcurve
computes
performance measures as in the following table. For example, a cutoff
of 0.5 corresponds to the middle row where now rows 2 and 4 are counted
as incorrectly classified. Notice that only the FN
and FP
columns
differ between these two tables.
TP | FN | FP | TN |
0 | 2 | 1 | 1 |
1 | 1 | 1 | 1 |
1 | 1 | 2 | 0 |
For data with three or more
classes, perfcurve
takes one positive class and
a list of negative classes for input. The function computes the X
and Y
values
using counts in the positive class to estimate TP
and FN
,
and using counts in all negative classes to estimate TN
and FP
. perfcurve
can
optionally compute Y
values for each negative class
separately and, in addition to Y
, return a matrix
of size M
-by-C
, where M
is
the number of elements in X
or Y
and C
is
the number of negative classes. You can use this functionality to
monitor components of the negative class contribution. For example,
you can plot TP
counts on the X
-axis
and FP
counts on the Y
-axis.
In this case, the returned matrix shows how the FP
component
is split across negative classes.
You can also use perfcurve
to
estimate confidence intervals. perfcurve
computes
confidence bounds using either cross-validation or bootstrap. If you
supply cell arrays for labels
and scores
, perfcurve
uses
cross-validation and treats elements in the cell arrays as cross-validation
folds. If you set input parameter NBoot
to a positive
integer, perfcurve
generates nboot
bootstrap
replicas to compute pointwise confidence bounds.
perfcurve
estimates the confidence bounds using one of two methods:
Vertical averaging (VA) — estimate confidence bounds on Y
and
T
at fixed values of X
. Use the
XVals
input parameter to use this method for computing confidence bounds.
Threshold averaging (TA) — estimate confidence bounds for X
and Y
at fixed thresholds for the positive class score. Use the
TVals
input parameter to use this method for computing confidence bounds.
To use observation weights instead of observation counts, you can use the
'Weights'
parameter in your call to perfcurve
. When you
use this parameter, to compute X
, Y
and
T
or to compute confidence bounds by cross-validation,
perfcurve
uses your supplied observation weights instead of observation
counts. To compute confidence bounds by bootstrap, perfcurve
samples
N out of N with replacement using your weights as
multinomial sampling probabilities.