To understand more how SVMs work, I am training a binary SVM with the function fitcsvm, using a sample data set of completely random numbers and cross-validating the classifier with a 10-fold cross-validation.
Since the dataset consists of random numbers, I would expect the classification accuracy of the trained cross-validated SVM to be around 50%.
However, with small datasets, for example consisting of 2 predictors and 12 observations (6 per class), I get very high classification accuracy, up to about 75%. Classification accuracy gets close to 50% by increasing the dataset, for example 2 predictors and 60 observations or 40 predictors and 12 observations. Why with small datasets is the classification accuracy so high?
I guess that with small datasets you might more easily go into over-fitting. Is this the case here?
Anyway, with cross-validation, the SVM is recursively trained on nine partitions and tested on the tenth. Even if the dataset is small, I would anyway expect an accuracy of around 50%, simply because the tenth partition is made of random numbers. Does the cross-validation perform some optimization of the model parameters?
The code that I am using is something like the following, where I try 100 different combinations of Kernel Scale and Box Constraint and then take the combination that yields the lowest classification error:
SVMModel = fitcsvm(cdata, label, 'KernelFunction','linear', 'Standardize',true,...
MisclassRate = kfoldLoss(SVMModel);
I would very much appreciate any clarification. Many thanks!