Why is SVM performance with small random datasets so high?
23 views (last 30 days)
Show older comments
To understand more how SVMs work, I am training a binary SVM with the function fitcsvm, using a sample data set of completely random numbers and cross-validating the classifier with a 10-fold cross-validation.
Since the dataset consists of random numbers, I would expect the classification accuracy of the trained cross-validated SVM to be around 50%.
However, with small datasets, for example consisting of 2 predictors and 12 observations (6 per class), I get very high classification accuracy, up to about 75%. Classification accuracy gets close to 50% by increasing the dataset, for example 2 predictors and 60 observations or 40 predictors and 12 observations. Why with small datasets is the classification accuracy so high?
I guess that with small datasets you might more easily go into over-fitting. Is this the case here?
Anyway, with cross-validation, the SVM is recursively trained on nine partitions and tested on the tenth. Even if the dataset is small, I would anyway expect an accuracy of around 50%, simply because the tenth partition is made of random numbers. Does the cross-validation perform some optimization of the model parameters?
The code that I am using is something like the following, where I try 100 different combinations of Kernel Scale and Box Constraint and then take the combination that yields the lowest classification error:
SVMModel = fitcsvm(cdata, label, 'KernelFunction','linear', 'Standardize',true,...
'KernelScale',KS,'BoxConstraint',BC,...
'CrossVal','on','KFold',10);
MisclassRate = kfoldLoss(SVMModel);
I would very much appreciate any clarification. Many thanks!
0 Comments
Accepted Answer
Ilya
on 27 Feb 2017
Let me make sure I got your procedure right. You apply M models to a dataset and measure their accuracies by cross-validation. Each model is described by a set of parameter values such as box constraint and kernel scale. Out of these models, you select the one with largest cross-validation accuracy a_best and record the parameter values for this model pars_best. To estimate the significance of this model, you learn the same model (that is, pass pars_best to fitcsvm) on R synthetic datasets. Each synthetic dataset is obtained by randomly permuting class labels in the original dataset. You estimate cdf F(a) over these R accuracy values. Then you take 1-F(a_best) to be the p-value for the null hypothesis "the model pars_best has no discriminative power".
If I got this right, you should modify your procedure like so. In every run (that is, for every noise dataset), instead of recording accuracy of a model learned using pars_best, search for the best model over M parameter values and record the accuracy for that best model. Estimate cdf F_noisebest(a) using these R values and take 1-F_noisebest(a_best) to be the p-value.
In your procedure, you apply a classifier to a noise dataset and its accuracy is expected to be that of a random coin toss (perhaps, unfair toss if you have imbalanced classes). In my procedure, you choose the best out of M classifiers applied to a noise dataset and the best chosen accuracy is going to be most usually better (or a lot better) than a random coin toss. This could increase your estimate of the p-value quite a bit making the best model pars_best less significant.
You could also use simple analytic formulas for the binomial distribution and order statistic to verify your computation.
0 Comments
More Answers (1)
Ilya
on 31 Jan 2017
You have 12 observations. For each observation, the probability of correct classification is 0.5. What is the probability of classifying 9 or more observations correctly by chance? It's
>> p = binocdf(8,12,0.5,'upper')
p =
0.0730
And what is the probability of that chance event occurring at least once in 100 experiments? It's
>> binocdf(0,100,p,'upper')
ans =
0.9995
Since you take the most accurate model, you always get a highly optimistic estimate of accuracy, that's all.
5 Comments
Ilya
on 3 Feb 2017
If I understand, you take the cross-validation accuracy for the best model (the same accuracy that was used to identify the best model) and then compare that accuracy with a distribution obtained for noise (randomly permuted labels). If that's what you do, your procedure is incorrect. It always produces an estimate of model performance (accuracy, significance, whatever you call it) that is optimistically biased.
You select the model with the highest accuracy, but you do not know if this value is high by chance or because the model is really good. Then you take that high value and compare it with a distribution of noise. If the accuracy value is high, it naturally gets into the tail of the noise distribution. But that does not prove that the model is really good. This only shows that the accuracy value is high, which is what you established in the first place.
Dealing with small datasets is tough and usually requires domain knowledge. Maybe you can generate synthetic data by adding some noise to the predictors. Maybe, despite what you think, you can set off a fraction of the dataset for testing.
See Also
Categories
Find more on Discriminant Analysis in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!