Compare accuracies of two classification models by repeated cross-validation

`testckfold`

statistically assesses the accuracies of two
classification models by repeatedly cross-validating the two models, determining the
differences in the classification loss, and then formulating the test statistic by
combining the classification loss differences. This type of test is particularly
appropriate when sample size is limited.

You can assess whether the accuracies of the classification models are different, or
whether one classification model performs better than another. Available tests include a
5-by-2 paired *t* test, a 5-by-2 paired *F* test, and
a 10-by-10 repeated cross-validation *t* test. For more details, see
Repeated Cross-Validation Tests. To speed up computations,
`testckfold`

supports parallel computing (requires a Parallel
Computing Toolbox™ license).

returns
the test decision that results from conducting a 5-by-2 paired `h`

= testckfold(`C1`

,`C2`

,`X1`

,`X2`

)*F* cross-validation
test. The null hypothesis is the classification models `C1`

and `C2`

have
equal accuracy in predicting the true class labels using the predictor
and response data in the tables `X1`

and `X2`

. `h`

= `1`

indicates
to reject the null hypothesis at the 5% significance level.

`testckfold`

conducts the cross-validation
test by applying `C1`

and `C2`

to
all predictor variables in `X1`

and `X2`

,
respectively. The true class labels in `X1`

and `X2`

must
be the same. The response variable names in `X1`

, `X2`

, `C1.ResponseName`

,
and `C2.ResponseName`

must be the same.

For examples of ways to compare models, see Tips.

uses
any of the input arguments in the previous syntaxes and additional
options specified by one or more `h`

= testckfold(___,`Name,Value`

)`Name,Value`

pair
arguments. For example, you can specify the type of alternative hypothesis,
the type of test, or the use of parallel computing.

Examples of ways to compare models include:

Compare the accuracies of a simple classification model and a more complex model by passing the same set of predictor data.

Compare the accuracies of two different models using two different sets of predictors.

Perform various types of Feature Selection. For example, you can compare the accuracy of a model trained using a set of predictors to the accuracy of one trained on a subset or different set of predictors. You can arbitrarily choose the set of predictors, or use a feature selection technique like PCA or sequential feature selection (see

`pca`

and`sequentialfs`

).

If both of these statements are true, then you can omit supplying

`Y`

.Consequently,

`testckfold`

uses the common response variable in the tables.One way to perform cost-insensitive feature selection is:

Create a classification model template that characterizes the first classification model (

`C1`

).Create a classification model template that characterizes the second classification model (

`C2`

).Specify two predictor data sets. For example, specify

`X1`

as the full predictor set and`X2`

as a reduced set.Enter

`testckfold(C1,C2,X1,X2,Y,'Alternative','less')`

. If`testckfold`

returns`1`

, then there is enough evidence to suggest that the classification model that uses fewer predictors performs better than the model that uses the full predictor set.

Alternatively, you can assess whether there is a significant difference between the accuracies of the two models. To perform this assessment, remove the

`'Alternative','less'`

specification in step 4.`testckfold`

conducts a two-sided test, and`h = 0`

indicates that there is not enough evidence to suggest a difference in the accuracy of the two models.The tests are appropriate for the misclassification rate classification loss, but you can specify other loss functions (see

`LossFun`

). The key assumptions are that the estimated classification losses are independent and normally distributed with mean 0 and finite common variance under the two-sided null hypothesis. Classification losses other than the misclassification rate can violate this assumption.Highly discrete data, imbalanced classes, and highly imbalanced cost matrices can violate the normality assumption of classification loss differences.

If you specify to conduct the 10-by-10 repeated cross-validation *t* test
using `'Test','10x10t'`

, then `testckfold`

uses
10 degrees of freedom for the *t* distribution to
find the critical region and estimate the *p*-value.
For more details, see [2] and [3].

Use `testcholdout`

:

For test sets with larger sample sizes

To implement variants of the McNemar test to compare two classification model accuracies

For cost-sensitive testing using a chi-square or likelihood ratio test. The chi-square test uses

`quadprog`

, which requires an Optimization Toolbox™ license.

[1] Alpaydin, E. “Combined 5 x 2 CV F
Test for Comparing Supervised Classification Learning Algorithms.” *Neural
Computation*, Vol. 11, No. 8, 1999, pp. 1885–1992.

[2] Bouckaert. R. “Choosing Between Two
Learning Algorithms Based on Calibrated Tests.” *International
Conference on Machine Learning*, 2003, pp. 51–58.

[3] Bouckaert, R., and E. Frank. “Evaluating
the Replicability of Significance Tests for Comparing Learning Algorithms.” *Advances
in Knowledge Discovery and Data Mining, 8th Pacific-Asia Conference*,
2004, pp. 3–12.

[4] Dietterich, T. “Approximate statistical
tests for comparing supervised classification learning algorithms.” *Neural
Computation*, Vol. 10, No. 7, 1998, pp. 1895–1923.

[5] Hastie, T., R. Tibshirani, and J. Friedman. *The
Elements of Statistical Learning*, 2nd Ed. New York: Springer,
2008.

`templateDiscriminant`

| `templateECOC`

| `templateEnsemble`

| `templateKNN`

| `templateNaiveBayes`

| `templateSVM`

| `templateTree`

| `testcholdout`