Train binary support vector machine classifier
returns
a support
vector machine classifier SVMModel
= fitcsvm(X
,Y
)SVMModel
, trained
by predictors X
and class labels Y
for
one or twoclass classification.
returns
a support vector machine classifier with additional options specified
by one or more SVMModel
= fitcsvm(X
,Y
,Name,Value
)Name,Value
pair arguments.
For example, you can specify the type of cross validation, the cost for misclassification, or the type of score transformation function.
Load Fisher's iris data set. Remove the sepal lengths and widths, and all observed setosa irises.
load fisheriris inds = ~strcmp(species,'setosa'); X = meas(inds,3:4); y = species(inds);
Train an SVM classifier using the processed data set.
SVMModel = fitcsvm(X,y)
SVMModel = ClassificationSVM PredictorNames: {'x1' 'x2'} ResponseName: 'Y' ClassNames: {'versicolor' 'virginica'} ScoreTransform: 'none' NumObservations: 100 Alpha: [24x1 double] Bias: 14.4149 KernelParameters: [1x1 struct] BoxConstraints: [100x1 double] ConvergenceInfo: [1x1 struct] IsSupportVector: [100x1 logical] Solver: 'SMO'
The Command Window shows that SVMModel
is a trained ClassificationSVM
classifier and a property list. Display the properties of SVMModel
, for example, to determine the class order, by using dot notation.
classOrder = SVMModel.ClassNames
classOrder = 'versicolor' 'virginica'
The first class ('versicolor'
) is the negative class, and the second ('virginica'
) is the positive class. You can change the class order during training by using the 'ClassNames'
namevalue pair argument.
Plot a scatter diagram of the data and circle the support vectors.
sv = SVMModel.SupportVectors; figure gscatter(X(:,1),X(:,2),y) hold on plot(sv(:,1),sv(:,2),'ko','MarkerSize',10) legend('versicolor','virginica','Support Vector') hold off
The support vectors are observations that occur on or beyond their estimated class boundaries.
You can adjust the boundaries (and therefore the number of support vectors) by setting a box constraint during training using the 'BoxConstraint'
namevalue pair argument.
Load the ionosphere
data set.
load ionosphere rng(1); % For reproducibility
Train an SVM classifier using the radial basis kernel. Let the software find a scale value for the kernel function. It is good practice to standardize the predictors.
SVMModel = fitcsvm(X,Y,'Standardize',true,'KernelFunction','RBF',... 'KernelScale','auto');
SVMModel
is a trained ClassificationSVM
classifier.
Cross validate the SVM classifier. By default, the software uses 10fold cross validation.
CVSVMModel = crossval(SVMModel);
CVSVMModel
is a ClassificationPartitionedModel
crossvalidated classifier.
Estimate the outofsample misclassification rate.
classLoss = kfoldLoss(CVSVMModel)
classLoss = 0.0484
The generalization rate is approximately 5%.
Load Fisher's iris data set. Remove the petal lengths and widths. Treat all irises as coming from the same class.
load fisheriris
X = meas(:,1:2);
y = ones(size(X,1),1);
Train an SVM classifier using the processed data set. Assume that 5% of the observations are outliers. It is good practice to standardize the predictors.
rng(1); SVMModel = fitcsvm(X,y,'KernelScale','auto','Standardize',true,... 'OutlierFraction',0.05);
SVMModel
is a trained ClassificationSVM
classifier. By default, the software uses the Gaussian kernel for oneclass learning.
Plot the observations and the decision boundary. Flag the support vectors and potential outliers.
svInd = SVMModel.IsSupportVector; h = 0.02; % Mesh grid step size [X1,X2] = meshgrid(min(X(:,1)):h:max(X(:,1)),... min(X(:,2)):h:max(X(:,2))); [~,score] = predict(SVMModel,[X1(:),X2(:)]); scoreGrid = reshape(score,size(X1,1),size(X2,2)); figure plot(X(:,1),X(:,2),'k.') hold on plot(X(svInd,1),X(svInd,2),'ro','MarkerSize',10) contour(X1,X2,scoreGrid) colorbar; title('{\bf Iris Outlier Detection via OneClass SVM}') xlabel('Sepal Length (cm)') ylabel('Sepal Width (cm)') legend('Observation','Support Vector') hold off
The boundary separating the outliers from the rest of the data occurs where the contour value is 0
.
Verify that the fraction of observations with negative scores in the crossvalidated data is close to 5%.
CVSVMModel = crossval(SVMModel); [~,scorePred] = kfoldPredict(CVSVMModel); outlierRate = mean(scorePred<0)
outlierRate = 0.0467
Load Fisher's iris data set. Use the petal lengths and widths.
load fisheriris
X = meas(:,3:4);
Y = species;
Examine a scatter plot of the data.
figure gscatter(X(:,1),X(:,2),Y); h = gca; lims = [h.XLim h.YLim]; % Extract the x and y axis limits title('{\bf Scatter Diagram of Iris Measurements}'); xlabel('Petal Length (cm)'); ylabel('Petal Width (cm)'); legend('Location','Northwest');
There are three classes, one of which is linearly separable from the others.
For each class:
Create a logical vector (indx
) indicating whether an observation is a member of the class.
Train an SVM classifier using the predictor data and indx
.
Store the classifier in a cell of a cell array.
% It is good practice to define the class order and standardize the % predictors. SVMModels = cell(3,1); classes = unique(Y); rng(1); % For reproducibility for j = 1:numel(classes); indx = strcmp(Y,classes(j)); % Create binary classes for each classifier SVMModels{j} = fitcsvm(X,indx,'ClassNames',[false true],'Standardize',true,... 'KernelFunction','rbf','BoxConstraint',1); end
SVMModels
is a 3by1 cell array, with each cell containing a ClassificationSVM
classifier. For each cell, the positive class is setosa, versicolor, and virginica, respectively.
Define a fine grid within the plot, and treat the coordinates as new observations from the distribution of the training data. Estimate the score of the new observations using each classifier.
d = 0.02; [x1Grid,x2Grid] = meshgrid(min(X(:,1)):d:max(X(:,1)),... min(X(:,2)):d:max(X(:,2))); xGrid = [x1Grid(:),x2Grid(:)]; N = size(xGrid,1); Scores = zeros(N,numel(classes)); for j = 1:numel(classes); [~,score] = predict(SVMModels{j},xGrid); Scores(:,j) = score(:,2); % Second column contains positiveclass scores end
Each row of Scores
contains three scores. The index of the element with the largest score is the index of the class to which the new class observation most likely belongs.
Associate each new observation with the classifier that gives it the maximum score.
[~,maxScore] = max(Scores,[],2);
Color in the regions of the plot based on which class the corresponding new observation belongs.
figure h(1:3) = gscatter(xGrid(:,1),xGrid(:,2),maxScore,... [0.1 0.5 0.5; 0.5 0.1 0.5; 0.5 0.5 0.1]); hold on h(4:6) = gscatter(X(:,1),X(:,2),Y); title('{\bf Iris Classification Regions}'); xlabel('Petal Length (cm)'); ylabel('Petal Width (cm)'); legend(h,{'setosa region','versicolor region','virginica region',... 'observed setosa','observed versicolor','observed virginica'},... 'Location','Northwest'); axis tight hold off
X
— Predictor datamatrix of numeric valuesPredictor data to which the SVM classifier is trained, specified as a matrix of numeric values.
Each row of X
corresponds to one observation
(also known as an instance or example), and each column corresponds
to one predictor.
The length of Y
and the number of rows of X
must
be equal.
It is good practice to:
Cross validate using the KFold
namevalue
pair argument. The crossvalidation results determine how well the
SVM classifier generalizes.
Standardize the predictor variables using the Standardize
namevalue
pair argument.
To specify the names of the predictors in the order of their
appearance in X
, use the PredictorNames
namevalue
pair argument.
Data Types: double
 single
Y
— Class labelscategorical array  character array  logical vector  vector of numeric values  cell array of stringsClass labels to which the SVM classifier is trained, specified as a categorical or character array, logical or numeric vector, or cell array of strings.
If Y
is a character array, then each element
must correspond to one row of the array.
The length of Y
and the number of rows of X
must
be equal.
It is good practice to specify the order of the classes using
the ClassNames
namevalue pair argument.
To specify the response variable name, use the ResponseName
namevalue
pair argument.
Specify optional commaseparated pairs of Name,Value
arguments.
Name
is the argument
name and Value
is the corresponding
value. Name
must appear
inside single quotes (' '
).
You can specify several name and value pair
arguments in any order as Name1,Value1,...,NameN,ValueN
.
'KFold',10,'Cost',[0 2;1 0],'ScoreTransform','sign'
specifies
to perform 10fold cross validation, apply double the penalty to false
positives compared to false negatives, and transform the scores using
the sign function.'Alpha'
— Initial estimates of alpha coefficientsvector of nonnegative real valuesInitial estimates of alpha coefficients, specified as the commaseparated
pair consisting of 'Alpha'
and a vector of nonnegative
real values. The length of Alpha
must be equal
to the number of rows of X
.
Each element of Alpha
corresponds
to an observation in X
.
Alpha
cannot contain any NaN
s.
If you specify Alpha
and any of
the crossvalidation namevalue pair arguments ('CrossVal'
, 'CVPartition'
, 'Holdout'
, 'KFold'
,
or 'Leaveout'
), then the software returns an error.
The defaults are:
0.5*ones(size(X,1),1)
for oneclass
learning
zeros(size(X,1),1)
for twoclass
learning
Example: 'Alpha',0.1*ones(size(X,1),1)
Data Types: double
 single
'BoxConstraint'
— Box constraint1 (default)  positive scalarBox constraint,
specified as the commaseparated pair consisting of 'BoxConstraint'
and
a positive scalar.
For oneclass learning, the software always sets the box constraint
to 1
.
Example: 'BoxConstraint',100
Data Types: double
 single
'CacheSize'
— Cache size1000
(default)  'maximal'
 positive scalarCache size, specified as the commaseparated pair consisting
of 'CacheSize'
and 'maximal'
or
a positive scalar.
If CacheSize
is 'maximal'
,
then the software reserves enough disk space to hold the entire nbyn Gram matrix.
If CacheSize
is a positive scalar, then the
software reserves CacheSize
megabytes of disk space
for training the classifier.
Example: 'CacheSize','maximal'
Data Types: double
 char
 single
'ClassNames'
— Class namescategorical array  character array  logical vector  vector of numeric values  cell array of stringsClass names, specified as the commaseparated pair consisting
of 'ClassNames'
and categorical or character array,
logical or numeric vector, or cell array of strings. You must set ClassNames
using
the data type of Y
.
The default is the distinct class names of Y
.
If Y
is a character array, then each element
must correspond to one row of the array.
Use ClassNames
to order the classes or to
select a subset of classes for training.
Example: 'ClassNames',logical([0,1])
'Cost'
— Misclassification costsquare matrix  structure arrayMisclassification cost, specified as the commaseparated pair
consisting of 'Cost'
and a square matrix or structure.
If you specify:
The square matrix Cost
, then Cost(i,j)
is
the cost of classifying a point into class j
if
its true class is i
(i.e., the rows correspond
to the true class and the columns correspond to the predicted class).
To specify the class order for the corresponding rows and columns
of Cost
, additionally specify the ClassNames
namevalue
pair argument.
The structure S
, then it must have
two fields:
S.ClassNames
, which contains the
class names as a variable of the same data type as Y
S.ClassificationCosts
, which contains
the cost matrix with rows and columns ordered as in S.ClassNames
For twoclass learning, if you specify a cost matrix, then the software updates the prior probabilities by incorporating the penalties described in the cost matrix. Subsequently, the cost matrix resets to the default. For more details, see Algorithms.
The defaults are:
For oneclass learning, Cost = 0
.
For twoclass learning, Cost(i,j) = 1
if i
~= j
, and Cost(i,j) = 0
if i
= j
.
Example: 'Cost',[0,1;2,0]
Data Types: double
 single
 struct
'CrossVal'
— Flag to train crossvalidated classifier'off'
(default)  'on'
Flag to train a crossvalidated classifier, specified as the
commaseparated pair consisting of 'Crossval'
and
a string.
If you specify 'on'
, then the software trains
a crossvalidated classifier with 10 folds.
You can override this crossvalidation setting using one of
the 'KFold'
, 'Holdout'
, 'Leaveout'
,
or 'CVPartition'
namevalue pair arguments.
You can only use one of these four options at a time for creating
a crossvalidated model: 'KFold'
, 'Holdout'
, 'Leaveout'
,
or 'CVPartition'
.
Alternatively, crossvalidate SVMModel
later
by passing it to crossval
.
Example: 'Crossval','on'
Data Types: char
'CVPartition'
— Crossvalidation partition[]
(default)  cvpartition
partition objectCrossvalidation partition, specified as the commaseparated
pair consisting of 'CVPartition'
and a cvpartition
partition
object as created by cvpartition
.
The partition object specifies the type of crossvalidation, and also
the indexing for training and validation sets.
If you specify CVPartition
, then you cannot
specify any of Holdout
, KFold
,
or Leaveout
.
'DeltaGradientTolerance'
— Tolerance for gradient differencenonnegative scalarTolerance for the gradient difference between upper and lower
violators obtained by Sequential Minimal Optimization (SMO) or Iterative
Single Data Algorithm (ISDA), specified as the commaseparated pair
consisting of 'DeltaGradientTolerance'
and a nonnegative
scalar.
If DeltaGradientTolerance
is 0
,
then the software does not use the tolerance for the gradient difference
to check for optimization convergence.
The defaults are:
1e3
if the solver is SMO (for
example, you set 'Solver','SMO'
)
0
if the solver is ISDA (for example,
you set 'Solver','ISDA'
)
Example: 'DeltaGapTolerance',1e2
Data Types: double
 single
'GapTolerance'
— Feasibility gap tolerance0
(default)  nonnegative scalarFeasibility gap tolerance obtained by SMO or ISDA, specified
as the commaseparated pair consisting of 'GapTolerance'
and
a nonnegative scalar.
If GapTolerance
is 0
,
then the software does not use the feasibility gap tolerance to check
for optimization convergence.
Example: 'GapTolerance',1e2
Data Types: double
 single
'Holdout'
— Fraction of data for holdout validationscalar value in the range (0,1)Fraction of data used for holdout validation, specified as the
commaseparated pair consisting of 'Holdout'
and
a scalar value in the range (0,1). If you specify 'Holdout',
,
then the software: p
Randomly reserves
%
of the data as validation data, and trains the model using the rest
of the datap
*100
Stores the compact, trained model in CVMdl
.Trained
If you specify Holdout
, then you cannot specify
any of CVPartition
, KFold
,
or Leaveout
.
Example: 'Holdout',0.1
Data Types: double
 single
'IterationLimit'
— Maximal number of numerical optimization iterations1e6
(default)  positive integerMaximal number of numerical optimization iterations, specified
as the commaseparated pair consisting of 'IterationLimit'
and
a positive integer.
The software returns a trained classifier regardless of whether the optimization routine successfully converges.
Example: 'IterationLimit',1e8
Data Types: double
 single
'KernelFunction'
— Kernel functionstringKernel function used to compute the Gram matrix, specified as the commaseparated
pair consisting of 'KernelFunction'
and a string.
This table summarizes the available options for setting a kernel function.
Value  Description  Formula 

'gaussian' or 'rbf'  Gaussian or Radial Basis Function (RBF) kernel, default for oneclass learning  $$G\left({x}_{1},{x}_{2}\right)=\mathrm{exp}\left({\Vert {x}_{1}{x}_{2}\Vert}^{2}\right)$$ 
'linear'  Linear kernel, default for twoclass learning  $$G({x}_{1},{x}_{2})={x}_{1}\prime {x}_{2}$$ 
'polynomial'  Polynomial kernel. Use 'PolynomialOrder',polyOrder to
specify a polynomial kernel of order polyOrder .  $$G({x}_{1},{x}_{2})={(1+{x}_{1}\prime {x}_{2})}^{p}$$ 
You can set your own kernel
function, for example, kernel
, by setting 'KernelFunction','kernel'
. kernel
must
have the following form:
function G = kernel(U,V)
U
is an mbyp matrix.
V
is an nbyp matrix.
G
is an mbyn Gram
matrix of the rows of U
and V
.
And kernel.m
must be on the MATLAB^{®} path.
It is good practice to avoid using generic names for kernel
functions. For example, call a sigmoid kernel function 'mysigmoid'
rather
than 'sigmoid'
.
Example: 'KernelFunction','gaussian'
Data Types: char
'KernelOffset'
— Kernel offset parameternonnegative scalarKernel offset parameter, specified as the commaseparated pair
consisting of 'KernelOffset'
and a nonnegative
scalar.
The software adds KernelOffset
to each element
of the Gram matrix.
The defaults are:
0
if the solver is SMO (for example,
you set 'Solver','SMO'
)
0.1
if the solver is ISDA (for
example, you set 'Solver','ISDA'
)
Example: 'KernelOffset',0
Data Types: double
 single
'KernelScale'
— Kernel scale parameter1
(default)  'auto'
 positive scalarKernel scale parameter, specified as the commaseparated pair
consisting of 'KernelScale'
and 'auto'
or
a positive scalar.
If KernelFunction
is 'gaussian'
('rbf'
), 'linear'
,
or 'polymonial'
, then the software divides all
elements of the predictor matrix X
by the value
of KernelScale
. Then, the software applies the
appropriate kernel norm to compute the Gram matrix.
If you specify 'auto'
, then the
software uses a heuristic procedure to select the scale value. The
heuristic procedure uses subsampling. Therefore, to reproduce results,
set a random number seed using rng
before
training the classifier.
If you specify KernelScale
and
your own kernel function, for example, kernel
,
using 'KernelFunction','kernel'
, then the software
displays an error. You must apply scaling within kernel
.
Example: 'KernelScale',''auto'
Data Types: double
 single
 char
'KFold'
— Number of folds10
(default)  positive integer valueNumber of folds to use in a crossvalidated classifier, specified
as the commaseparated pair consisting of 'KFold'
and
a positive integer value.
You can only use one of these four options at a time to create
a crossvalidated model: 'KFold'
, 'Holdout'
, 'Leaveout'
,
or 'CVPartition'
.
Example: 'KFold',8
Data Types: single
 double
'KKTTolerance'
— KarushKuhnTucker complementarity conditions violation tolerancenonnegative scalarKarushKuhnTucker
(KKT) complementarity conditions violation tolerance, specified
as the commaseparated pair consisting of 'KKTTolerance'
and
a nonnegative scalar.
If KKTTolerance
is 0
,
then the software does not use the KKT complementarity conditions
violation tolerance to check for optimization convergence.
The defaults are:
0
if the solver is SMO (for example,
you set 'Solver','SMO'
)
1e3
if the solver is ISDA (for
example, you set 'Solver','ISDA'
)
Example: 'KKTTolerance',1e2
Data Types: double
 single
'Leaveout'
— Leaveoneout crossvalidation flag'off'
(default)  'on'
Leaveoneout crossvalidation flag, specified as the commaseparated
pair consisting of 'Leaveout'
and either 'on'
or 'off'
.
If you specify 'on'
, then the software implements
leaveoneout cross validation.
If you use 'Leaveout'
, you cannot use these 'CVPartition'
, 'Holdout'
,
or 'KFold'
namevalue pair arguments.
Example: 'Leaveout','on'
Data Types: char
'Nu'
— ν parameter for oneclass learning0.5
(default)  positive scalarν parameter for oneclass learning, specified as
the commaseparated pair consisting of 'Nu'
and
a positive scalar. Nu
must be greater than 0
and
at most 1
.
Set Nu
to control the tradeoff between ensuring
most training examples are in the positive class and minimizing the
weights in the score function.
Example: 'Nu',0.25
Data Types: double
 single
'NumPrint'
— Number of iterations between optimization diagnostic message
output1000
(default)  nonnegative integerNumber of iterations between optimization diagnostic message
output, specified as the commaseparated pair consisting of 'NumPrint'
and
a nonnegative integer.
If you use 'Verbose',1
and 'NumPrint',numprint
,
then the software displays all optimization diagnostic messages from
SMO and ISDA every numprint
iterations in the Command
Window.
Example: 'NumPrint',500
Data Types: double
 single
'OutlierFraction'
— Expected proportion of outliers in training data0
(default)  nonnegative scalarExpected proportion of outliers in the training data, specified
as the commaseparated pair consisting of 'OutlierFraction'
and
a nonnegative scalar. OutlierFraction
must be at
least 0
and less than 1
.
If you set 'OutlierFraction',outlierfraction
,
where outlierfraction
is a value greater than 0,
then:
For twoclass learning, the software implements robust
learning. In other words, the software attempts to remove
100*outlierfraction
% of the observations when the
optimization algorithm converges. The removed observations correspond
to gradients that are large in magnitude.
For oneclass learning, the software finds an appropriate
bias term such that outlierfraction
of the observations
in the training set have negative scores.
Example: 'OutlierFraction',0.01
Data Types: double
 single
'PolynomialOrder'
— Polynomial kernel function order3
(default)  positive integerPolynomial kernel function order, specified as the commaseparated
pair consisting of 'PolynomialOrder'
and a positive
integer.
If you set 'PolynomialOrder'
and KernelFunction
is
not 'polynomial'
, then the software displays an
error.
Example: 'PolynomialOrder',2
Data Types: double
 single
'PredictorNames'
— Predictor variable names{'x1','x2',...}
(default)  cell array of stringsPredictor variable names, specified as the commaseparated pair
consisting of 'PredictorNames'
and a cell array
of strings containing the names for the predictor variables, in the
order in which they appear in X
.
Example: 'PredictorNames',{'PedalWidth','PedalLength'}
Data Types: cell
'Prior'
— Prior probabilities'empirical'
(default)  'uniform'
 numeric vector  structurePrior probabilities for each class, specified as the commaseparated
pair consisting of 'Prior'
and a string, numeric
vector, or a structure.
This table summarizes the available options for setting prior probabilities.
Value  Description 

'empirical'  The class prior probabilities are the class relative frequencies
in Y . 
'uniform'  All class prior probabilities are equal to 1/K, where K is the number of classes. 
numeric vector  Each element is a class prior probability. Order the elements
according to SVMModel.ClassNames or specify the
order using the ClassNames namevalue pair argument.
The software normalizes the elements such that they sum to 1 . 
structure  A structure

For twoclass learning, if you specify a cost matrix, then the software updates the prior probabilities by incorporating the penalties described in the cost matrix. For more details, see Algorithms.
Example: struct('ClassNames',{{'setosa','versicolor'}},'ClassProbs',[1,2])
Data Types: char
 double
 single
 struct
'ResponseName'
— Response variable name'Y'
(default)  stringResponse variable name, specified as the commaseparated pair
consisting of 'ResponseName'
and a string containing
the name of the response variable Y
.
Example: 'ResponseName','IrisType'
Data Types: char
'ScoreTransform'
— Score transform function'none'
(default)  'doublelogit'
 'invlogit'
 'ismax'
 'logit'
 'sign'
 'symmetric'
 'symmetriclogit'
 'symmetricismax'
 function handleScore transform function, specified as the commaseparated pair
consisting of 'ScoreTransform'
and a string or
function handle.
If the value is a string, then it must correspond to a builtin function. This table summarizes the available, builtin functions.
String  Formula 

'doublelogit'  1/(1 + e^{–2x}) 
'invlogit'  log(x / (1–x)) 
'ismax'  Set the score for the class with the largest score to 1 ,
and scores for all other classes to 0 . 
'logit'  1/(1 + e^{–x}) 
'none'  x (no transformation) 
'sign'  –1 for x < 0 0 for x = 0 1 for x > 0 
'symmetric'  2x – 1 
'symmetriclogit'  2/(1 + e^{–x}) – 1 
'symmetricismax'  Set the score for the class with the largest score to 1 ,
and scores for all other classes to 1 . 
For a MATLAB function, or a function that you define, enter its function handle.
SVMModel.ScoreTransform = @function;
function
should accept a matrix (the original
scores) and return a matrix of the same size (the transformed scores).
Example: 'ScoreTransform','sign'
Data Types: char
 function_handle
'ShrinkagePeriod'
— Number of iterations between movement of observations from
active to inactive set0
(default)  nonnegative integerNumber of iterations between the movement of observations from
the active to inactive set, specified as the commaseparated pair
consisting of 'ShrinkagePeriod'
and a nonnegative
integer.
If you set 'ShrinkagePeriod',0
, then the
software does not shrink the active set.
Example: 'ShrinkagePeriod',1000
Data Types: double
 single
'Solver'
— Optimization routine'ISDA'
 'L1QP'
 'SMO'
Optimization routine, specified as a string.
This table summarizes the available optimization routine options.
Value  Description 

'ISDA'  Iterative Single Data Algorithm (see [4]) 
'L1QP'  Uses quadprog to implement L1
softmargin minimization by quadratic programming. This option requires
an Optimization Toolbox™ license. For more details, see Quadratic Programming Definition. 
'SMO'  Sequential Minimal Optimization (see [2]) 
The defaults are:
'ISDA'
if you set 'OutlierFraction'
to
a positive value and for twoclass learning
'SMO'
otherwise
Example: 'Solver','ISDA'
Data Types: char
'Standardize'
— Flag to standardize predictorsfalse
(default)  true
Flag to standardize the predictors, specified as the commaseparated
pair consisting of 'Standardize'
and true
(1
)
or false
(0)
.
If you set 'Standardize',true
, then the software
centers and scales each column of the predictor data (X
)
by the column mean and standard deviation, respectively. It is good
practice to standardize the predictor data.
Example: 'Standardize',true
Data Types: logical
'Verbose'
— Verbosity level0
(default)  1
 2
Verbosity level, specified as the commaseparated pair consisting
of 'Verbose'
and either 0
, 1
,
or 2
. Verbose
controls the amount
of optimization information that the software displays in the Command
Window and saves as a structure to SVMModel.ConvergenceInfo.History
.
This table summarizes the available verbosity level options.
Value  Description 

0  The software does not display or save convergence information. 
1  The software displays diagnostic messages and saves convergence
criteria every numprint iterations, where numprint is
the value of the namevalue pair argument 'NumPrint' . 
2  The software displays diagnostic messages and saves convergence criteria at every iteration. 
Example: 'Verbose',1
Data Types: double
 single
'Weights'
— Observation weightsones(size(X,1),1)
(default)  numeric vectorObservation weights, specified as the commaseparated pair consisting
of 'Weights'
and a numeric vector.
The size of Weights
must equal the number
of rows of X
. The software weighs the observations
in each row of X
with the corresponding weight
in Weights
.
The software normalizes Weights
to sum up
to the value of the prior probability in the respective class.
Data Types: double
 single
SVMModel
— Trained SVM classifierClassificationSVM
classifier  ClassificationPartitionedModel
crossvalidated
classifierTrained SVM classifier, returned as a ClassificationSVM
classifier
or ClassificationPartitionedModel
crossvalidated
classifier.
If you set any of the namevalue pair arguments KFold
, Holdout
, Leaveout
, CrossVal
,
or CVPartition
, then SVMModel
is
a ClassificationPartitionedModel
crossvalidated
classifier. Otherwise, SVMModel
is a ClassificationSVM
classifier.
To reference properties of SVMModel
, use
dot notation. For example, enter SVMModel.Alpha
in
the Command Window to display the trained Lagrange multipliers.
fitcsvm
trains SVM classifiers
for one or twoclass learning applications. To train SVM classifiers
using data with more than two classes, use fitcecoc
.
A parameter that controls the maximum penalty imposed on marginviolating observations, and aids in preventing overfitting (regularization).
If you increase the box constraint, then the SVM classifier assigns fewer support vectors. However, increasing the box constraint can lead to longer training times.
The Gram matrix of a set of n vectors {x_{1},..,x_{n}; x_{j} ∊ R^{p}} is an nbyn matrix with element (j,k) defined as G(x_{j},x_{k}) = <ϕ(x_{j}),ϕ(x_{k})>, an inner product of the transformed predictors using the kernel function ϕ.
For nonlinear SVM, the algorithm forms a Gram matrix using the predictor matrix columns. The dual formalization replaces the inner product of the predictors with corresponding elements of the resulting Gram matrix (called the "kernel trick"). Subsequently, nonlinear SVM operates in the transformed predictor space to find a separating hyperplane.
KKT complementarity conditions are optimization constraints required for optimal nonlinear programming solutions.
In SVM, the KKT complementarity conditions are
$$\{\begin{array}{l}{\alpha}_{j}\left[{y}_{j}\left(w\prime \varphi \left({x}_{j}\right)+b\right)1+{\xi}_{j}\right]=0\\ {\xi}_{j}\left(C{\alpha}_{j}\right)=0\end{array}$$
for all j = 1,...,n, where w_{j} is a weight, ϕ is a kernel function (see Gram matrix), and ξ_{j} is a slack variable. If the classes are perfectly separable, then ξ_{j} = 0 for all j = 1,...,n.
Oneclass learning, or unsupervised SVM, aims at separating data from the origin in the highdimensional, predictor space (not the original predictor space), and is an algorithm used for outlier detection.
The algorithm resembles that of SVM for binary classification. The objective is to minimize dual expression
$$0.5{\displaystyle \sum _{jk}{\alpha}_{j}}{\alpha}_{k}G({x}_{j},{x}_{k})$$
with respect to $${\alpha}_{1},\mathrm{...},{\alpha}_{n}$$, subject to
$$\sum {\alpha}_{j}}=n\nu $$
and $$0\le {\alpha}_{j}\le 1$$ for all j = 1,...,n. G(x_{j},x_{k},) is element (j,k) of the Gram matrix.
A small value of ν leads to fewer support vectors, and, therefore, a smooth, crude decision boundary. A large value of ν leads to more support vectors, and therefore, a curvy, flexible decision boundary. The optimal value of ν should be large enough to capture the data complexity and small enough to avoid overtraining. Also, 0 < ν ≤ 1.
For more details, see [5].
Support vectors are observations corresponding to strictly positive estimates of α_{1},...,α_{n}.
SVM classifiers that yield fewer support vectors for a given training set are more desirable.
The SVM binary classification algorithm searches for an optimal hyperplane that separates the data into two classes. For separable classes, the optimal hyperplane maximizes a margin (space that does not contain any observations) surrounding itself, which creates boundaries for the positive and negative classes. For inseparable classes, the objective is the same, but the algorithm imposes a penalty on the length of the margin for every observation that is on the wrong side of its class boundary.
The linear SVM score function is
$$f(x)=x\prime \beta +{\beta}_{0},$$
where:
x is an observation (corresponding
to a row of X
).
The vector β contains the
coefficients that define an orthogonal vector to the hyperplane (corresponding
to SVMModel.Beta
). For separable data, the optimal
margin length is $$2/\Vert \beta \Vert .$$
β_{0} is
the bias term (corresponding to SVMModel.Bias
).
The root of f(x) for particular coefficients defines a hyperplane. For a particular hyperplane, f(z) is the distance from point z to the hyperplane.
An SVM classifier searches for the maximum margin length, while keeping observations in the positive (y = 1) and negative (y = –1) classes separate. Therefore:
For separable classes, the objective is to minimize $$\Vert \beta \Vert $$ with respect to the β and β_{0} subject to y_{j}f(x_{j}) ≥ 1, for all j = 1,..,n. This is the primal formalization for separable classes.
For inseparable classes, SVM uses slack variables (ξ_{j}) to penalize the objective function for observations that cross the margin boundary for their class. ξ_{j} = 0 for observations that do not cross the margin boundary for their class, otherwise ξ_{j} ≥ 0.
The objective is to minimize$$0.5{\Vert \beta \Vert}^{2}+C{\displaystyle \sum {\xi}_{j}}$$ with respect to the β, β_{0}, and ξ_{j} subject to $${y}_{j}f({x}_{j})\ge 1{\xi}_{j}$$ and $${\xi}_{j}\ge 0$$ for all j = 1,..,n, and for a positive scalar box constraint C. This is the primal formalization for inseparable classes.
SVM uses the Lagrange multipliers method to optimize the objective.
This introduces n coefficients α_{1},...,α_{n}
(corresponding to SVMModel.Alpha
). The dual formalizations
for linear SVM are:
For separable classes, minimize
$$0.5{\displaystyle \sum _{j=1}^{n}{\displaystyle \sum}_{k=1}^{n}}{\alpha}_{j}{\alpha}_{k}{y}_{j}{y}_{k}{x}_{j}\prime {x}_{k}{\displaystyle \sum}_{j=1}^{n}{\alpha}_{j}$$
with respect to α_{1},...,α_{n}, subject to $$\sum {\alpha}_{j}}{y}_{j}=0$$, α_{j} ≥ 0 for all j = 1,...,n, and KarushKuhnTucker (KKT) complementarity conditions.
For inseparable classes, the objective is the same as for separable classes, except for the additional condition $$0\le {\alpha}_{j}\le C$$ for all j = 1,..,n.
The resulting score function is
$$f(x)={\displaystyle \sum _{j=1}^{n}{\widehat{\alpha}}_{j}}{y}_{j}x\prime {x}_{j}+\widehat{b}.$$
The score function is free of the estimate of β as a result of the primal formalization.
In some cases, there is a nonlinear boundary separating the classes. Nonlinear SVM works in a transformed predictor space to find an optimal, separating hyperplane.
The dual formalization for nonlinear SVM is
$$0.5{\displaystyle \sum _{j=1}^{n}{\displaystyle \sum _{k=1}^{n}{\alpha}_{j}}}{\alpha}_{k}{y}_{j}{y}_{k}G({x}_{j},{x}_{k}){\displaystyle \sum _{j=1}^{n}{\alpha}_{j}}$$
with respect to α_{1},...,α_{n}, subject to $$\sum {\alpha}_{j}}{y}_{j}=0$$, $$0\le {\alpha}_{j}\le C$$ for all j = 1,..,n, and the KKT complementarity conditions.G(x_{k},x_{j}) are elements of the Gram matrix. The resulting score function is
$$f(x)={\displaystyle \sum _{j=1}^{n}{\alpha}_{j}}{y}_{j}G(x,{x}_{j})+b.$$
For more details, see Understanding Support Vector Machines, [1], and [3].
For oneclass learning:
The default setting for the namevalue pair argument 'Alpha'
can
lead to long training times. To speed up training, set Alpha
to
a vector mostly composed of 0
s.
Set the namevalue pair argument Nu
to
a value closer to 0
to yield fewer support vectors,
and, therefore, a smoother, but crude decision boundary
Sparsity in support vectors is a desirable property
of an SVM classifier. To decrease the number of support vectors, set BoxConstraint
to
a large value. This also increases the training time.
For large data sets, try optimizing the cache size. This can have a significant impact on the training speed.
If the support vector set is much less than the number
of observations in the training set, then you might significantly
speed up convergence by shrinking the activeset using the namevalue
pair argument 'ShrinkagePeriod'
. It is good practice
to use 'ShrinkagePeriod',1000
.
All solvers implement L1 softmargin minimization.
fitcsvm
and svmtrain
use,
among other algorithms, SMO for optimization. The software implements
SMO differently between the two functions, but numerical studies show
that there is sensible agreement in the results.
For oneclass learning, the software estimates the Lagrange multipliers, α_{1},...,α_{n}, such that
$$\sum _{j=1}^{n}{\alpha}_{j}}=n\nu .$$
For twoclass learning, if you specify a cost matrix C, then the software updates the class prior probabilities (p) to p_{c} by incorporating the penalties described in C. The formula for the updated prior probability vector is
$${p}_{c}=\frac{p\prime C}{\sum p\prime C}.$$
Subsequently, the software resets the cost matrix to the default:
$$C=\left[\begin{array}{cc}0& 1\\ 1& 0\end{array}\right].$$
If you set 'Standardize',true
when
you train the SVM classifier using fitcsvm
,
then the software trains the classifier using the standardized predictor
matrix, but stores the unstandardized data in the classifier property X
.
However, if you standardize the data, then the data size in memory
doubles until optimization ends.
If you set 'Standardize',true
and
any of 'Cost'
, 'Prior'
, or 'Weights'
,
then the software standardizes the predictors using their corresponding
weighted means and weighted standard deviations.
Let p
be the proportion of outliers
you expect in the training data. If you use 'OutlierFraction',p
when
you train the SVM classifier using fitcsvm
, then:
For oneclass learning, the software trains the bias
term such that 100p
% of the observations in the
training data have negative scores.
The software implements robust learning for
twoclass learning. In other words, the software attempts to remove
100p
% of the observations when the optimization
algorithm converges. The removed observations correspond to gradients
that are large in magnitude.
[1] Christianini, N., and J. C. ShaweTaylor. An Introduction to Support Vector Machines and Other KernelBased Learning Methods. Cambridge, UK: Cambridge University Press, 2000.
[2] Fan, R.E., P.H. Chen, and C.J. Lin. "Working set selection using second order information for training support vector machines." Journal of Machine Learning Research, Vol 6, 2005, pp. 1889–1918.
[3] Hastie, T., R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, Second Edition. NY: Springer, 2008.
[4] Kecman V., T. M. Huang, and M. Vogt. "Iterative Single Data Algorithm for Training Kernel Machines from Huge Data Sets: Theory and Performance." In Support Vector Machines: Theory and Applications. Edited by Lipo Wang, 255–274. Berlin: SpringerVerlag, 2005.
[5] Scholkopf, B., J. C. Platt, J. C. ShaweTaylor, A. J. Smola, and R. C. Williamson. "Estimating the Support of a HighDimensional Distribution." Neural Comput., Vol. 13, Number 7, 2001, pp. 1443–1471.
[6] Scholkopf, B., and A. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond, Adaptive Computation and Machine Learning. Cambridge, MA: The MIT Press, 2002.
ClassificationPartitionedModel
 ClassificationSVM
 CompactClassificationSVM
 fitcecoc
 fitSVMPosterior
 predict
 quadprog
 rng