Support vector machine (SVM) for one-class and binary classification

`ClassificationSVM`

is a support vector machine (SVM) classifier for one-class and two-class
learning. Trained `ClassificationSVM`

classifiers store training data,
parameter values, prior probabilities, support vectors, and algorithmic implementation
information. Use these classifiers to perform tasks such as fitting a
score-to-posterior-probability transformation function (see `fitPosterior`

) and predicting labels for new data (see `predict`

).

Create a `ClassificationSVM`

object by using `fitcsvm`

.

`compact` | Reduce size of support vector machine (SVM) classifier |

`compareHoldout` | Compare accuracies of two classification models using new data |

`crossval` | Cross-validate support vector machine (SVM) classifier |

`discardSupportVectors` | Discard support vectors for linear support vector machine (SVM) classifier |

`edge` | Find classification edge for support vector machine (SVM) classifier |

`fitPosterior` | Fit posterior probabilities for support vector machine (SVM) classifier |

`loss` | Find classification error for support vector machine (SVM) classifier |

`margin` | Find classification margins for support vector machine (SVM) classifier |

`predict` | Classify observations using support vector machine (SVM) classifier |

`resubEdge` | Find classification edge for support vector machine (SVM) classifier by resubstitution |

`resubLoss` | Find classification loss for support vector machine (SVM) classifier by resubstitution |

`resubMargin` | Find classification margins for support vector machine (SVM) classifier by resubstitution |

`resubPredict` | Classify observations in support vector machine (SVM) classifier |

`resume` | Resume training support vector machine (SVM) classifier |

For the mathematical formulation of the SVM binary classification algorithm, see Support Vector Machines for Binary Classification and Understanding Support Vector Machines.

`NaN`

,`<undefined>`

, empty character vector (`''`

), empty string (`""`

), and`<missing>`

values indicate missing values.`fitcsvm`

removes entire rows of data corresponding to a missing response. When computing total weights (see the next bullets),`fitcsvm`

ignores any weight corresponding to an observation with at least one missing predictor. This action can lead to unbalanced prior probabilities in balanced-class problems. Consequently, observation box constraints might not equal`BoxConstraint`

.`fitcsvm`

removes observations that have zero weight or prior probability.For two-class learning, if you specify the cost matrix $$\mathcal{C}$$ (see

`Cost`

), then the software updates the class prior probabilities*p*(see`Prior`

) to*p*by incorporating the penalties described in $$\mathcal{C}$$._{c}Specifically,

`fitcsvm`

completes these steps:Compute $${p}_{c}^{\ast}=p\prime \mathcal{C}.$$

Normalize

*p*_{c}^{*}so that the updated prior probabilities sum to 1.$${p}_{c}=\frac{1}{{\displaystyle \sum _{j=1}^{K}{p}_{c,j}^{\ast}}}{p}_{c}^{\ast}.$$

*K*is the number of classes.Reset the cost matrix to the default

$$\mathcal{C}=\left[\begin{array}{cc}0& 1\\ 1& 0\end{array}\right].$$

Remove observations from the training data corresponding to classes with zero prior probability.

For two-class learning,

`fitcsvm`

normalizes all observation weights (see`Weights`

) to sum to 1. The function then renormalizes the normalized weights to sum up to the updated prior probability of the class to which the observation belongs. That is, the total weight for observation*j*in class*k*is$${w}_{j}^{\ast}=\frac{{w}_{j}}{{\displaystyle \sum _{\forall j\in \text{Class}k}{w}_{j}}}{p}_{c,k}.$$

*w*is the normalized weight for observation_{j}*j*;*p*_{c,k}is the updated prior probability of class*k*(see previous bullet).For two-class learning,

`fitcsvm`

assigns a box constraint to each observation in the training data. The formula for the box constraint of observation*j*is$${C}_{j}=n{C}_{0}{w}_{j}^{\ast}.$$

*n*is the training sample size,*C*_{0}is the initial box constraint (see the`'BoxConstraint'`

name-value pair argument), and $${w}_{j}^{\ast}$$ is the total weight of observation*j*(see previous bullet).If you set

`'Standardize',true`

and the`'Cost'`

,`'Prior'`

, or`'Weights'`

name-value pair argument, then`fitcsvm`

standardizes the predictors using their corresponding weighted means and weighted standard deviations. That is,`fitcsvm`

standardizes predictor*j*(*x*) using_{j}$${x}_{j}^{\ast}=\frac{{x}_{j}-{\mu}_{j}^{\ast}}{{\sigma}_{j}^{\ast}}.$$

$${\mu}_{j}^{\ast}=\frac{1}{{\displaystyle \sum _{k}{w}_{k}^{\ast}}}{\displaystyle \sum _{k}{w}_{k}^{\ast}{x}_{jk}}.$$

*x*is observation_{jk}*k*(row) of predictor*j*(column).$${\left({\sigma}_{j}^{\ast}\right)}^{2}=\frac{{v}_{1}}{{v}_{1}^{2}-{v}_{2}}{\displaystyle \sum _{k}{w}_{k}^{\ast}{\left({x}_{jk}-{\mu}_{j}^{\ast}\right)}^{2}}.$$

$${v}_{1}={\displaystyle \sum _{j}{w}_{j}^{\ast}}.$$

$${v}_{2}={\displaystyle \sum _{j}{\left({w}_{j}^{\ast}\right)}^{2}}.$$

Assume that

`p`

is the proportion of outliers that you expect in the training data, and that you set`'OutlierFraction',p`

.For one-class learning, the software trains the bias term such that 100

`p`

% of the observations in the training data have negative scores.The software implements

*robust learning*for two-class learning. In other words, the software attempts to remove 100`p`

% of the observations when the optimization algorithm converges. The removed observations correspond to gradients that are large in magnitude.

If your predictor data contains categorical variables, then the software generally uses full dummy encoding for these variables. The software creates one dummy variable for each level of each categorical variable.

The

`PredictorNames`

property stores one element for each of the original predictor variable names. For example, assume that there are three predictors, one of which is a categorical variable with three levels. Then`PredictorNames`

is a 1-by-3 cell array of character vectors containing the original names of the predictor variables.The

`ExpandedPredictorNames`

property stores one element for each of the predictor variables, including the dummy variables. For example, assume that there are three predictors, one of which is a categorical variable with three levels. Then`ExpandedPredictorNames`

is a 1-by-5 cell array of character vectors containing the names of the predictor variables and the new dummy variables.Similarly, the

`Beta`

property stores one beta coefficient for each predictor, including the dummy variables.The

`SupportVectors`

property stores the predictor values for the support vectors, including the dummy variables. For example, assume that there are*m*support vectors and three predictors, one of which is a categorical variable with three levels. Then`SupportVectors`

is an*n*-by-5 matrix.The

`X`

property stores the training data as originally input and does not include the dummy variables. When the input is a table,`X`

contains only the columns used as predictors.

For predictors specified in a table, if any of the variables contain ordered (ordinal) categories, the software uses ordinal encoding for these variables.

For a variable with

*k*ordered levels, the software creates*k*– 1 dummy variables. The*j*th dummy variable is –1 for levels up to*j*, and +1 for levels*j*+ 1 through*k*.The names of the dummy variables stored in the

`ExpandedPredictorNames`

property indicate the first level with the value +1. The software stores*k*– 1 additional predictor names for the dummy variables, including the names of levels 2, 3, ...,*k*.

All solvers implement

*L*1 soft-margin minimization.For one-class learning, the software estimates the Lagrange multipliers,

*α*_{1},...,*α*, such that_{n}$$\sum _{j=1}^{n}{\alpha}_{j}}=n\nu .$$

[1] Hastie, T., R. Tibshirani, and J. Friedman. *The
Elements of Statistical Learning*, Second Edition. NY: Springer,
2008.

[2] Scholkopf, B., J. C. Platt, J. C. Shawe-Taylor, A. J. Smola,
and R. C. Williamson. “Estimating the Support of a High-Dimensional
Distribution.” *Neural Comput*., Vol. 13, Number 7, 2001,
pp. 1443–1471.

[3] Christianini, N., and J. C. Shawe-Taylor. *An Introduction to Support
Vector Machines and Other Kernel-Based Learning Methods*. Cambridge, UK:
Cambridge University Press, 2000.

[4] Scholkopf, B., and A. Smola. *Learning with Kernels: Support Vector
Machines, Regularization, Optimization and Beyond, Adaptive Computation and Machine
Learning.* Cambridge, MA: The MIT Press, 2002.