Superclasses: CompactClassificationSVM
Support vector machine for binary classification
ClassificationSVM
is a support vector machine classifier for
one or twoclass learning. To train a ClassificationSVM
classifier,
use fitcsvm
.
Trained ClassificationSVM
classifiers store
the training data, parameter values, prior probabilities, support
vectors, and algorithmic implementation information. You can use these
classifiers to:
Estimate resubstitution predictions. For details,
see resubPredict
.
Predict labels or posterior probabilities for new
data. For details, see predict
.
returns
an SVM classifier (Mdl
= fitcsvm(Tbl
,ResponseVarName
)Mdl
) trained using the sample
data contained in the table Tbl
. ResponseVarName
is
the name of the variable in Tbl
that contains
the class labels for one or twoclass classification. For details,
see fitcsvm
.
returns
an SVM classifer trained using the predictor data and class labels
in the table Mdl
= fitcsvm(Tbl
,formula
)Tbl
. formula
is
an explanatory model of the response and a subset of predictor variables
in Tbl
used for training. For details, see fitcsvm
.
returns
an SVM classifer trained using the predictor variables in table Mdl
= fitcsvm(Tbl
,Y
)Tbl
and
class labels in vector Y
. For details, see fitcsvm
.
returns
an SVM classifier trained using the predictors in the matrix Mdl
=
fitcsvm(X
,Y
)X
and
class labels in the vector Y
for one or twoclass
classification. For details, see fitcsvm
.
returns
a trained SVM classifier with additional options specified by one
or more Mdl
= fitcsvm(___,Name,Value
)Name,Value
pair arguments, using any of
the previous syntaxes. For example, you can specify the type of cross
validation, the cost for misclassification, or the type of score transformation
function. For namevalue pair argument details, see fitcsvm
.
If you set one of the following five options, then Mdl
is
a ClassificationPartitionedModel
model: 'CrossVal'
, 'CVPartition'
, 'Holdout'
, 'KFold'
,
or 'Leaveout'
. Otherwise, Mdl
is
a ClassificationSVM
classifier.
Code Generation support: Yes.
MATLAB Function Block support: Yes.

sby1 numeric vector of trained classifier
coefficients from the dual problem, that is, the estimated Lagrange
multipliers. s is the number of support vectors
in the trained classifier, that is, If you specify removing duplicates using  

Numeric vector of linear predictor coefficients. If your predictor data contains categorical variables, then
the software uses full dummy encoding for these variables. The software
creates one dummy variable for each level of each categorical variable. If $$f\left(x\right)=\left(x/s\right)\prime \beta +b.$$
If  

Scalar corresponding to the trained classifier bias term.  

nby1 numeric vector of box constraints. n is
the number of observations in the training data (see the If you specify removing duplicates using  

Structure array containing:
 

Indices of categorical predictors, stored as a numeric vector.  

List of elements in  

Structure array containing convergence information.
 

Square matrix, where During training, the software updates the prior probabilities by incorporating the penalties described in the cost matrix. Therefore,
This property is readonly. For more details, see Algorithms.  

Expanded predictor names, stored as a cell array of character vectors. If the model uses encoding for categorical variables, then  

Numeric vector of training data gradient values.  

Description of the crossvalidation optimization of hyperparameters,
stored as a
 

nby1 logical vector indicating whether
a corresponding row in the predictor data matrix is a support vector. n is
the number of observations in the training data (see If you specify removing duplicates using  

Structure array containing the kernel name and parameter values. To display the values of The software accepts  

Object containing parameter values, e.g., the namevalue pair
argument values, used to train the SVM classifier. Access fields of  

Numeric vector of predictor means. If you specify If your predictor data contains categorical variables, then
the software uses full dummy encoding for these variables. The software
creates one dummy variable for each level of each categorical variable. If  

Positive integer indicating the number of iterations required by the optimization routine to attain convergence. To set a limit on the number of iterations to, e.g.,  

Positive scalar representing the ν parameter for oneclass learning.  

Numeric scalar representing the number
of observations in the training data. If the input arguments  

Scalar indicating the expected proportion of outliers in the training data.  

Cell array of character vectors containing the predictor names, in the order that they appear in the training data.  

Numeric vector of prior probabilities for each class. The order
of the elements of For twoclass learning, if you specify a cost matrix, then the software updates the prior probabilities by incorporating the penalties described in the cost matrix. This property is readonly. For more details, see Algorithms.  

Character vector describing the response variable  

Character vector representing a builtin transformation function, or a function handle for transforming predicted classification scores. To change the score transformation function to, e.g.,
 

Nonnegative integer indicating the shrinkage period, i.e., number of iterations between reductions of the active set. To set the shrinkage period to, e.g.,  

Numeric vector of predictor standard deviations. If you specify If your predictor data contains categorical variables, then
the software uses full dummy encoding for these variables. The software
creates one dummy variable for each level of each categorical variable. If  

Character vector indicating the solving routine that the software used to train the SVM classifier. To set the solver to, e.g.,  

sbyp numeric matrix
containing rows of If you specify If you specify removing duplicates using  

sby1 numeric vector of support vector class
labels. s is the number of support vectors in the
trained classifier, that is, A value of If you specify removing duplicates using  

Numeric vector of observation weights that the software used to train the SVM classifier. The length of
 

Numeric matrix of unstandardized predictor values that the software used to train the SVM classifier. Each row of The software excludes predictor data rows removed due to  

Categorical or character array, logical or numeric vector, or
cell array of character vectors representing the observed class labels
used to train the SVM classifier. Each row of The software excludes elements removed due to 
compact  Compact support vector machine classifier 
crossval  Crossvalidated support vector machine classifier 
fitPosterior  Fit posterior probabilities 
resubEdge  Classification edge for support vector machine classifiers by resubstitution 
resubLoss  Classification loss for support vector machine classifiers by resubstitution 
resubMargin  Classification margins for support vector machine classifiers by resubstitution 
resubPredict  Predict support vector machine classifier resubstitution responses 
resume  Resume training support vector machine classifier 
compareHoldout  Compare accuracies of two classification models using new data 
discardSupportVectors  Discard support vectors for linear support vector machine models 
edge  Classification edge for support vector machine classifiers 
fitPosterior  Fit posterior probabilities 
loss  Classification error for support vector machine classifiers 
margin  Classification margins for support vector machine classifiers 
predict  Predict labels using support vector machine classification model 
A parameter that controls the maximum penalty imposed on marginviolating observations, and aids in preventing overfitting (regularization).
If you increase the box constraint, then the SVM classifier assigns fewer support vectors. However, increasing the box constraint can lead to longer training times.
The Gram matrix of a set of n vectors {x_{1},..,x_{n}; x_{j} ∊ R^{p}} is an nbyn matrix with element (j,k) defined as G(x_{j},x_{k}) = <ϕ(x_{j}),ϕ(x_{k})>, an inner product of the transformed predictors using the kernel function ϕ.
For nonlinear SVM, the algorithm forms a Gram matrix using the predictor matrix columns. The dual formalization replaces the inner product of the predictors with corresponding elements of the resulting Gram matrix (called the "kernel trick"). Subsequently, nonlinear SVM operates in the transformed predictor space to find a separating hyperplane.
KKT complementarity conditions are optimization constraints required for optimal nonlinear programming solutions.
In SVM, the KKT complementarity conditions are
$$\{\begin{array}{l}{\alpha}_{j}\left[{y}_{j}f\left({x}_{j}\right)1+{\xi}_{j}\right]=0\\ {\xi}_{j}\left(C{\alpha}_{j}\right)=0\end{array}$$
for all j = 1,...,n, where $$f\left({x}_{j}\right)=\varphi \left({x}_{j}\right)\prime \beta +b,$$ ϕ is a kernel function (see Gram matrix), and ξ_{j} is a slack variable. If the classes are perfectly separable, then ξ_{j} = 0 for all j = 1,...,n.
Oneclass learning, or unsupervised SVM, aims at separating data from the origin in the highdimensional, predictor space (not the original predictor space), and is an algorithm used for outlier detection.
The algorithm resembles that of SVM for binary classification. The objective is to minimize dual expression
$$0.5{\displaystyle \sum _{jk}{\alpha}_{j}}{\alpha}_{k}G({x}_{j},{x}_{k})$$
with respect to $${\alpha}_{1},\mathrm{...},{\alpha}_{n}$$, subject to
$$\sum {\alpha}_{j}}=n\nu $$
and $$0\le {\alpha}_{j}\le 1$$ for all j = 1,...,n. G(x_{j},x_{k}) is element (j,k) of the Gram matrix.
A small value of ν leads to fewer support vectors, and, therefore, a smooth, crude decision boundary. A large value of ν leads to more support vectors, and therefore, a curvy, flexible decision boundary. The optimal value of ν should be large enough to capture the data complexity and small enough to avoid overtraining. Also, 0 < ν ≤ 1.
For more details, see [5].
Support vectors are observations corresponding to strictly positive estimates of α_{1},...,α_{n}.
SVM classifiers that yield fewer support vectors for a given training set are more desirable.
The SVM binary classification algorithm searches for an optimal hyperplane that separates the data into two classes. For separable classes, the optimal hyperplane maximizes a margin (space that does not contain any observations) surrounding itself, which creates boundaries for the positive and negative classes. For inseparable classes, the objective is the same, but the algorithm imposes a penalty on the length of the margin for every observation that is on the wrong side of its class boundary.
The linear SVM score function is
$$f(x)=x\prime \beta +b,$$
where:
x is an observation (corresponding
to a row of X
).
The vector β contains the
coefficients that define an orthogonal vector to the hyperplane (corresponding
to Mdl.Beta
). For separable data, the optimal margin
length is $$2/\Vert \beta \Vert .$$
b is the bias term (corresponding
to Mdl.Bias
).
The root of f(x) for particular coefficients defines a hyperplane. For a particular hyperplane, f(z) is the distance from point z to the hyperplane.
The algorithm searches for the maximum margin length, while keeping observations in the positive (y = 1) and negative (y = –1) classes separate. Therefore:
For separable classes, the objective is to minimize $$\Vert \beta \Vert $$ with respect to the β and b subject to y_{j}f(x_{j}) ≥ 1, for all j = 1,..,n. This is the primal formalization for separable classes.
For inseparable classes, the algorithm uses slack variables (ξ_{j}) to penalize the objective function for observations that cross the margin boundary for their class. ξ_{j} = 0 for observations that do not cross the margin boundary for their class, otherwise ξ_{j} ≥ 0.
The objective is to minimize$$0.5{\Vert \beta \Vert}^{2}+C{\displaystyle \sum {\xi}_{j}}$$ with respect to the β, b, and ξ_{j} subject to $${y}_{j}f({x}_{j})\ge 1{\xi}_{j}$$ and $${\xi}_{j}\ge 0$$ for all j = 1,..,n, and for a positive scalar box constraint C. This is the primal formalization for inseparable classes.
The algorithm uses the Lagrange multipliers method to optimize
the objective. This introduces n coefficients α_{1},...,α_{n}
(corresponding to Mdl.Alpha
). The dual formalizations
for linear SVM are:
For separable classes, minimize
$$0.5{\displaystyle \sum _{j=1}^{n}{\displaystyle \sum}_{k=1}^{n}}{\alpha}_{j}{\alpha}_{k}{y}_{j}{y}_{k}{x}_{j}\prime {x}_{k}{\displaystyle \sum}_{j=1}^{n}{\alpha}_{j}$$
with respect to α_{1},...,α_{n}, subject to $$\sum {\alpha}_{j}}{y}_{j}=0$$, α_{j} ≥ 0 for all j = 1,...,n, and KarushKuhnTucker (KKT) complementarity conditions.
For inseparable classes, the objective is the same as for separable classes, except for the additional condition $$0\le {\alpha}_{j}\le C$$ for all j = 1,..,n.
The resulting score function is
$$\widehat{f}(x)={\displaystyle \sum _{j=1}^{n}{\widehat{\alpha}}_{j}}{y}_{j}x\prime {x}_{j}+\widehat{b}.$$
$$\widehat{b}$$ is the estimate of the bias and $${\widehat{\alpha}}_{j}$$ is the jth estimate of the vector $$\widehat{\alpha}$$, j = 1,...,n. Written this way, the score function is free of the estimate of β as a result of the primal formalization.
The SVM algorithm classifies a new observation, z using $$\text{sign}\left(\widehat{f}\left(z\right)\right).$$
In some cases, there is a nonlinear boundary separating the classes. Nonlinear SVM works in a transformed predictor space to find an optimal, separating hyperplane.
The dual formalization for nonlinear SVM is
$$0.5{\displaystyle \sum _{j=1}^{n}{\displaystyle \sum _{k=1}^{n}{\alpha}_{j}}}{\alpha}_{k}{y}_{j}{y}_{k}G({x}_{j},{x}_{k}){\displaystyle \sum _{j=1}^{n}{\alpha}_{j}}$$
with respect to α_{1},...,α_{n}, subject to $$\sum {\alpha}_{j}}{y}_{j}=0$$, $$0\le {\alpha}_{j}\le C$$ for all j = 1,..,n, and the KKT complementarity conditions.G(x_{k},x_{j}) are elements of the Gram matrix. The resulting score function is
$$\widehat{f}(x)={\displaystyle \sum _{j=1}^{n}{\widehat{\alpha}}_{j}}{y}_{j}G(x,{x}_{j})+\widehat{b}.$$
For more details, see Understanding Support Vector Machines, [1], and [3].
Value. To learn how value classes affect copy operations, see Copying Objects in the MATLAB documentation.
NaN
, <undefined>
,
and empty character vector (''
) values indicate
missing values. fitcsvm
removes entire rows of
data corresponding to a missing response. When computing total weights
(see the next bullets), fitcsvm
ignores any weight
corresponding to an observation with at least one missing predictor. This
action can lead to unbalanced prior probabilities in balancedclass
problems. Consequently, observation box constraints might not equal BoxConstraint
.
fitcsvm
removes observations that
have zero weight or prior probability.
For twoclass learning, if you specify the cost matrix $$\mathcal{C}$$ (see Cost
),
then the software updates the class prior probabilities p (see Prior
)
to p_{c} by incorporating the
penalties described in $$\mathcal{C}$$.
Specifically, fitcsvm
:
Computes $${p}_{c}^{\ast}=p\prime \mathcal{C}.$$
Normalizes p_{c}^{*} so that the updated prior probabilities sum 1:
$${p}_{c}=\frac{1}{{\displaystyle \sum _{j=1}^{K}{p}_{c,j}^{\ast}}}{p}_{c}^{\ast}.$$
K is the number of classes.
Resets the cost matrix to the default:
$$\mathcal{C}=\left[\begin{array}{cc}0& 1\\ 1& 0\end{array}\right].$$
Removes observations from the training data corresponding to classes with zero prior probability.
For twoclass learning, fitcsvm
normalizes
all observation weights (see Weights
) to sum
to 1. Then, renormalizes the normalized weights to sum up to the updated,
prior probability of the class to which the observation belongs. That
is, the total weight for observation j in class k is
$${w}_{j}^{\ast}=\frac{{w}_{j}}{{\displaystyle \sum _{\forall j\in \text{Class}k}{w}_{j}}}{p}_{c,k}.$$
w_{j} is the normalized weight for observation j; p_{c,k} is the updated prior probability of class k (see previous bullet).
For twoclass learning, fitcsvm
assigns
a box constraint to each observation in the training data. The formula
for the box constraint of observation j is
$${C}_{j}=n{C}_{0}{w}_{j}^{\ast}.$$
n is
the training sample size, C_{0} is
the initial box constraint (see BoxConstraint
),
and $${w}_{j}^{\ast}$$ is
the total weight of observation j (see previous bullet).
If you set 'Standardize',true
and
any of 'Cost'
, 'Prior'
, or 'Weights'
,
then fitcsvm
standardizes the predictors using
their corresponding weighted means and weighted standard deviations.
That is, fitcsvm
standardizes predictor j (x_{j})
using
$${x}_{j}^{\ast}=\frac{{x}_{j}{\mu}_{j}^{\ast}}{{\sigma}_{j}^{\ast}}.$$
$${\mu}_{j}^{\ast}=\frac{1}{{\displaystyle \sum _{k}{w}_{k}^{\ast}}}{\displaystyle \sum _{k}{w}_{k}^{\ast}{x}_{jk}}.$$
x_{jk} is observation k (row) of predictor j (column).
$${\left({\sigma}_{j}^{\ast}\right)}^{2}=\frac{{v}_{1}}{{v}_{1}^{2}{v}_{2}}{\displaystyle \sum _{k}{w}_{k}^{\ast}{\left({x}_{jk}{\mu}_{j}^{\ast}\right)}^{2}}.$$
$${v}_{1}={\displaystyle \sum _{j}{w}_{j}^{\ast}}.$$
$${v}_{2}={\displaystyle \sum _{j}{\left({w}_{j}^{\ast}\right)}^{2}}.$$
Let p
be the proportion of outliers
that you expect in the training data. If you set 'OutlierFraction',p
,
then:
For oneclass learning, the software trains the bias
term such that 100p
% of the observations in the
training data have negative scores.
The software implements robust learning for
twoclass learning. In other words, the software attempts to remove
100p
% of the observations when the optimization
algorithm converges. The removed observations correspond to gradients
that are large in magnitude.
If your predictor data contains categorical variables, then the software generally uses full dummy encoding for these variables. The software creates one dummy variable for each level of each categorical variable.
The PredictorNames
property stores
one element for each of the original predictor variable names. For
example, assume that there are three predictors, one of which is a
categorical variable with three levels. Then PredictorNames
is
a 1by3 cell array of character vectors containing the original names
of the predictor variables.
The ExpandedPredictorNames
property
stores one element for each of the predictor variables, including
the dummy variables. For example, assume that there are three predictors,
one of which is a categorical variable with three levels. Then ExpandedPredictorNames
is
a 1by5 cell array of character vectors containing the names of the
predictor variables and the new dummy variables.
Similarly, the Beta
property stores
one beta coefficient for each predictor, including the dummy variables.
The SupportVectors
property stores
the predictor values for the support vectors, including the dummy
variables. For example, assume that there are m support
vectors and three predictors, one of which is a categorical variable
with three levels. Then SupportVectors
is an nby5
matrix.
The X
property stores the training
data as originally input. It does not include the dummy variables.
When the input is a table, X
contains only the
columns used as predictors.
For predictors specified in a table, if any of the variables contain ordered (ordinal) categories, the software uses ordinal encoding for these variables.
For a variable having k ordered levels, the software creates k – 1 dummy variables. The jth dummy variable is 1 for levels up to j, and +1 for levels j + 1 through k.
The names of the dummy variables stored in the ExpandedPredictorNames
property
indicate the first level with the value +1.
The software stores k – 1 additional
predictor names for the dummy variables, including the names of levels
2, 3, ..., k.
All solvers implement L1 softmargin minimization.
fitcsvm
and svmtrain
use,
among other algorithms, SMO for optimization. The software implements
SMO differently between the two functions, but numerical studies show
that there is sensible agreement in the results.
For oneclass learning, the software estimates the Lagrange multipliers, α_{1},...,α_{n}, such that
$$\sum _{j=1}^{n}{\alpha}_{j}}=n\nu .$$
Notes and limitations for code generation when training a ClassificationSVM
model
include:
The class labels input argument value (Y
) cannot be a categorical
array.
The ClassNames
namevalue
pair argument cannot be a categorical array.
You cannot use the CategoricalPredictors
namevalue
pair argument or supply a table containing at least one categorical
predictor. That is, code generation does not support categorical predictors.
To dummycode variables that you want treated as categorical, use dummyvar
.
MATLAB does not support oneclass learning.
You cannot specify a score transformation function
by using the ScoreTransform
namevalue
pair argument or by assigning the ScoreTransform
object
property. Consequently, saveCompactModel
cannot
accept compact SVM models equipped to estimate class posterior probabilities,
that is, models returned by fitPosterior
or fitSVMPosterior
.
You can use this function in the MATLAB Function Block in Simulink^{®}.
[1] Hastie, T., R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, Second Edition. NY: Springer, 2008.
[2] Scholkopf, B., J. C. Platt, J. C. ShaweTaylor, A. J. Smola, and R. C. Williamson. "Estimating the Support of a HighDimensional Distribution." Neural Comput., Vol. 13, Number 7, 2001, pp. 1443–1471.
[3] Christianini, N., and J. C. ShaweTaylor. An Introduction to Support Vector Machines and Other KernelBased Learning Methods. Cambridge, UK: Cambridge University Press, 2000.
[4] Scholkopf, B. and A. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond, Adaptive Computation and Machine Learning Cambridge, MA: The MIT Press, 2002.