fitcsvm

Train binary support vector machine classifier

Syntax

  • SVMModel = fitcsvm(X,Y) example
  • SVMModel = fitcsvm(X,Y,Name,Value) example

Description

example

SVMModel = fitcsvm(X,Y) returns a support vector machine classifier SVMModel, trained by predictors X and class labels Y for one- or two-class classification.

example

SVMModel = fitcsvm(X,Y,Name,Value) returns a support vector machine classifier with additional options specified by one or more Name,Value pair arguments.

For example, you can specify the type of cross validation, the cost for misclassification, or the type of score transformation function.

Examples

expand all

Train a Support Vector Machine Classifier

Load Fisher's iris data set. Remove the sepal lengths and widths, and all observed setosa irises.

load fisheriris
inds = ~strcmp(species,'setosa');
X = meas(inds,3:4);
y = species(inds);

Train an SVM classifier using the processed data set.

SVMModel = fitcsvm(X,y)
SVMModel = 

  ClassificationSVM
      PredictorNames: {'x1'  'x2'}
        ResponseName: 'Y'
          ClassNames: {'versicolor'  'virginica'}
      ScoreTransform: 'none'
     NumObservations: 100
               Alpha: [24x1 double]
                Bias: -14.4149
    KernelParameters: [1x1 struct]
      BoxConstraints: [100x1 double]
     ConvergenceInfo: [1x1 struct]
     IsSupportVector: [100x1 logical]
              Solver: 'SMO'


The Command Window shows that SVMModel is a trained ClassificationSVM classifier and a property list. Display the properties of SVMModel, for example, to determine the class order, by using dot notation.

classOrder = SVMModel.ClassNames
classOrder = 

    'versicolor'
    'virginica'

The first class ('versicolor') is the negative class, and the second ('virginica') is the positive class. You can change the class order during training by using the 'ClassNames' name-value pair argument.

Plot a scatter diagram of the data and circle the support vectors.

sv = SVMModel.SupportVectors;
figure
gscatter(X(:,1),X(:,2),y)
hold on
plot(sv(:,1),sv(:,2),'ko','MarkerSize',10)
legend('versicolor','virginica','Support Vector')
hold off

The support vectors are observations that occur on or beyond their estimated class boundaries.

You can adjust the bondaries (and therefore the number of support vectors) by setting a box constaint during training using the 'BoxConstraint' name-value pair argument.

Train and Cross Validate an SVM Classifier

Load the ionosphere data set.

load ionosphere
rng(1); % For reproducibility

Train an SVM classifier using the radial basis kernel. Let the software find a scale value for the kernel function. It is good practice to standardize the predictors.

SVMModel = fitcsvm(X,Y,'Standardize',true,'KernelFunction','RBF',...
    'KernelScale','auto');

SVMModel is a trained ClassificationSVM classifier.

Cross validate the SVM classifier. By default, the software uses 10-fold cross validation.

CVSVMModel = crossval(SVMModel);

CVSVMModel is a ClassificationPartitionedModel cross-validated classifier.

Estimate the out-of-sample misclassification rate.

classLoss = kfoldLoss(CVSVMModel)
classLoss =

    0.0484

The generalization rate is approximately 5%.

Detect Outliers Using SVM and One-Class Learning

Load Fisher's iris data set. Remove the petal lengths and widths. Treat all irises as coming from the same class.

load fisheriris
X = meas(:,1:2);
y = ones(size(X,1),1);

Train an SVM classifier using the processed data set. Assume that 5% of the observations are outliers. It is good practice to standardize the predictors.

rng(1);
SVMModel = fitcsvm(X,y,'KernelScale','auto','Standardize',true,...
    'OutlierFraction',0.05);

SVMModel is a trained ClassificationSVM classifier. By default, the software uses the Gaussian kernel for one-class learning.

Plot the observations and the decision boundary. Flag the support vectors and potential outliers.

svInd = SVMModel.IsSupportVector;
h = 0.02; % Mesh grid step size
[X1,X2] = meshgrid(min(X(:,1)):h:max(X(:,1)),...
    min(X(:,2)):h:max(X(:,2)));
[~,score] = predict(SVMModel,[X1(:),X2(:)]);
scoreGrid = reshape(score,size(X1,1),size(X2,2));

figure
plot(X(:,1),X(:,2),'k.')
hold on
plot(X(svInd,1),X(svInd,2),'ro','MarkerSize',10)
contour(X1,X2,scoreGrid)
colorbar;
title('{\bf Iris Outlier Detection via One-Class SVM}')
xlabel('Sepal Length (cm)')
ylabel('Sepal Width (cm)')
legend('Observation','Support Vector')
hold off

The boundary separating the outliers from the rest of the data occurs where the contour value is 0.

Verify that the fraction of observations with negative scores in the cross-validated data is close to 5%.

CVSVMModel = crossval(SVMModel);
[~,scorePred] = kfoldPredict(CVSVMModel);
outlierRate = mean(scorePred<0)
outlierRate =

    0.0467

Find Multiple Class Boundaries Using Binary SVM

Load Fisher's iris data set. Use the petal lengths and widths.

load fisheriris
X = meas(:,3:4);
Y = species;

Examine a scatter plot of the data.

figure
gscatter(X(:,1),X(:,2),Y);
title('{\bf Scatter Diagram of Iris Measurements}');
xlabel('Petal Length (cm)');
ylabel('Petal Width (cm)');
legend('Location','Northwest');
lims = get(gca,{'XLim','YLim'}); % Extract the _x_ and _y_ axis limits

There are three classes, one of which is linearly separable from the others.

For each class:

  1. Create a logical vector (indx) indicating whether an observation is a member of the class.

  2. Train an SVM classifier using the predictor data and indx.

  3. Store the classifier in a cell of a cell array.

% It is good practice to define the class order and standardize the
% predictors.
SVMModels = cell(3,1);
classes = unique(Y);
rng(1); % For reproducibility

for j = 1:numel(classes);
    indx = strcmp(Y,classes(j)); % Create binary classes for each classifier
    SVMModels{j} = fitcsvm(X,indx,'ClassNames',[false true],'Standardize',true,...
        'KernelFunction','rbf','BoxConstraint',1);
end

SVMModels is a 3-by-1 cell array, with each cell containing a ClassificationSVM classifier. For each cell, the positive class is setosa, versicolor, and virginica, respectively.

Define a fine grid within the plot, and treat the coordinates as new observations from the distribution of the training data. Estimate the score of the new observations using each classifier.

d = 0.02;
[x1Grid,x2Grid] = meshgrid(min(X(:,1)):d:max(X(:,1)),...
    min(X(:,2)):d:max(X(:,2)));
xGrid = [x1Grid(:),x2Grid(:)];
N = size(xGrid,1);
Scores = zeros(N,numel(classes));

for j = 1:numel(classes);
    [~,score] = predict(SVMModels{j},xGrid);
    Scores(:,j) = score(:,2); % Second column contains positive-class scores
end

Each row of Scores contains three scores. The index of the element with the largest score is the index of the class to which the new class observation most likely belongs.

Associate each new observation with the classifier that gives it the maximum score.

[~,maxScore] = max(Scores,[],2);

Color in the regions of the plot based on which class the corresponding new observation belongs.

figure
h(1:3) = gscatter(xGrid(:,1),xGrid(:,2),maxScore,...
    [0.1 0.5 0.5; 0.5 0.1 0.5; 0.5 0.5 0.1]);
hold on
h(4:6) = gscatter(X(:,1),X(:,2),Y);
title('{\bf Iris Classification Regions}');
xlabel('Petal Length (cm)');
ylabel('Petal Width (cm)');
legend(h,{'setosa region','versicolor region','virginica region',...
    'observed setosa','observed versicolor','observed virginica'},...
    'Location','Northwest');
axis tight
hold off

Input Arguments

expand all

X — Predictor datamatrix of numeric values

Predictor data to which the SVM classifier is trained, specified as a matrix of numeric values.

Each row of X corresponds to one observation (also known as an instance or example), and each column corresponds to one predictor.

The length of Y and the number of rows of X must be equal.

It is good practice to:

  • Cross validate using the KFold name-value pair argument. The cross-validation results determine how well the SVM classifier generalizes.

  • Standardize the predictor variables using the Standardize name-value pair argument.

To specify the names of the predictors in the order of their appearance in X, use the PredictorNames name-value pair argument.

Data Types: double | single

Y — Class labelscategorical array | character array | logical vector | vector of numeric values | cell array of strings

Class labels to which the SVM classifier is trained, specified as a categorical or character array, logical or numeric vector, or cell array of strings.

If Y is a character array, then each element must correspond to one row of the array.

The length of Y and the number of rows of X must be equal.

It is good practice to specify the order of the classes using the ClassNames name-value pair argument.

To specify the response variable name, use the ResponseName name-value pair argument.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'KFold',10,'Cost',[0 2;1 0],'ScoreTransform','sign' specifies to perform 10-fold cross validation, apply double the penalty to false positives compared to false negatives, and transform the scores using the sign function.

'Alpha' — Initial estimates of alpha coefficientsvector of nonnegative real values

Initial estimates of alpha coefficients, specified as the comma-separated pair consisting of 'Alpha' and a vector of nonnegative real values. The length of Alpha must be equal to the number of rows of X.

  • Each element of Alpha corresponds to an observation in X.

  • Alpha cannot contain any NaNs.

  • If you specify Alpha and any of the cross-validation name-value pair arguments ('CrossVal', 'CVPartition', 'Holdout', 'KFold', or 'Leaveout'), then the software returns an error.

The defaults are:

  • 0.5*ones(size(X,1),1) for one-class learning

  • zeros(size(X,1),1) for two-class learning

Example: 'Alpha',0.1*ones(size(X,1),1)

Data Types: double | single

'BoxConstraint' — Box constraint1 (default) | positive scalar

Box constraint, specified as the comma-separated pair consisting of 'BoxConstraint' and a positive scalar.

For one-class learning, the software always sets the box constraint to 1.

Example: 'BoxConstraint',100

Data Types: double | single

'CacheSize' — Cache size1000 (default) | 'maximal' | positive scalar

Cache size, specified as the comma-separated pair consisting of 'CacheSize' and 'maximal' or a positive scalar.

If CacheSize is 'maximal', then the software reserves enough disk space to hold the entire n-by-n Gram matrix.

If CacheSize is a positive scalar, then the software reserves CacheSize megabytes of disk space for training the classifier.

Example: 'CacheSize','maximal'

Data Types: double | char | single

'ClassNames' — Class namescategorical array | character array | logical vector | vector of numeric values | cell array of strings

Class names, specified as the comma-separated pair consisting of 'ClassNames' and categorical or character array, logical or numeric vector, or cell array of strings. You must set ClassNames using the data type of Y.

The default is the distinct class names of Y.

If Y is a character array, then each element must correspond to one row of the array.

Use ClassNames to order the classes or to select a subset of classes for training.

Example: 'ClassNames',logical([0,1])

'Cost' — Misclassification costsquare matrix | structure array

Misclassification cost, specified as the comma-separated pair consisting of 'Cost' and a square matrix or structure. If you specify:

  • The square matrix Cost, then Cost(i,j) is the cost of classifying a point into class j if its true class is i

  • The structure S, then it must have two fields:

    • S.ClassNames, which contains the class names as a variable of the same data type as Y

    • S.ClassificationCosts, which contains the cost matrix with rows and columns ordered as in S.ClassNames

For two-class learning, if you specify a cost matrix, then the software updates the prior probabilities by incorporating the penalties described in the cost matrix. Subsequently, the cost matrix resets to the default. For more details, see Algorithms.

The defaults are:

  • For one-class learning, Cost = 0.

  • For two-class learning, Cost(i,j) = 1 if i ~= j, and Cost(i,j) = 0 if i = j.

Example: 'Cost',[0,1;2,0]

Data Types: double | single | struct

'CrossVal' — Flag to train cross-validated classifier'off' (default) | 'on'

Flag to train a cross-validated classifier, specified as the comma-separated pair consisting of 'Crossval' and a string.

If you specify 'on', then the software trains a cross-validated classifier with 10 folds.

You can override this cross-validation setting using one of the 'KFold', 'Holdout', 'Leaveout', or 'CVPartition' name-value pair arguments.

You can only use one of these four options at a time for creating a cross-validated model: 'KFold', 'Holdout', 'Leaveout', or 'CVPartition'.

Alternatively, cross-validate SVMModel later by passing it to crossval.

Example: 'Crossval','on'

Data Types: char

'CVPartition' — Cross-validation partition[] (default) | cvpartition partition object

Cross-validation partition, specified as the comma-separated pair consisting of 'CVPartition' and a cvpartition partition object as created by cvpartition. The partition object specifies the type of cross-validation, and also the indexing for training and validation sets.

If you specify CVPartition, then you cannot specify any of Holdout, KFold, or Leaveout.

'DeltaGradientTolerance' — Tolerance for gradient differencenonnegative scalar

Tolerance for the gradient difference between upper and lower violators obtained by Sequential Minimal Optimization (SMO) or Iterative Single Data Algorithm (ISDA), specified as the comma-separated pair consisting of 'DeltaGradientTolerance' and a nonnegative scalar.

If DeltaGradientTolerance is 0, then the software does not use the tolerance for the gradient difference to check for optimization convergence.

The defaults are:

  • 1e-3 if the solver is SMO (for example, you set 'Solver','SMO')

  • 0 if the solver is ISDA (for example, you set 'Solver','ISDA')

Example: 'DeltaGapTolerance',1e-2

Data Types: double | single

'GapTolerance' — Feasibility gap tolerance0 (default) | nonnegative scalar

Feasibility gap tolerance obtained by SMO or ISDA, specified as the comma-separated pair consisting of 'GapTolerance' and a nonnegative scalar.

If GapTolerance is 0, then the software does not use the feasibility gap tolerance to check for optimization convergence.

Example: 'GapTolerance',1e-2

Data Types: double | single

'Holdout' — Fraction of data for holdout validationscalar value in the range (0,1)

Fraction of data used for holdout validation, specified as the comma-separated pair consisting of 'Holdout' and a scalar value in the range (0,1). If you specify 'Holdout',p, then the software:

  1. Randomly reserves p*100% of the data as validation data, and trains the model using the rest of the data

  2. Stores the compact, trained model in CVMdl.Trained

If you specify Holdout, then you cannot specify any of CVPartition, KFold, or Leaveout.

Example: 'Holdout',0.1

Data Types: double | single

'IterationLimit' — Maximal number of numerical optimization iterations1e6 (default) | positive integer

Maximal number of numerical optimization iterations, specified as the comma-separated pair consisting of 'IterationLimit' and a positive integer.

The software returns a trained classifier regardless of whether the optimization routine successfully converges.

Example: 'IterationLimit',1e8

Data Types: double | single

'KernelFunction' — Kernel functionstring

Kernel function used to compute the Gram matrix, specified as the comma-separated pair consisting of 'KernelFunction' and a string.

This table summarizes the available options for setting a kernel function.

ValueDescriptionFormula
'gaussian' or 'rbf'Gaussian or Radial Basis Function (RBF) kernel, default for one-class learning

G(x1,x2)=exp(x1x22)

'linear'Linear kernel, default for two-class learning

G(x1,x2)=x1x2

'polynomial'Polynomial kernel. Use 'PolynomialOrder',polyOrder to specify a polynomial kernel of order polyOrder.

G(x1,x2)=(1+x1x2)p

You can set your own kernel function, for example, kernel, by setting 'KernelFunction','kernel'. kernel must have the following form:

function G = kernel(U,V)

where:

  • U is an m-by-p matrix.

  • V is an n-by-p matrix.

  • G is an m-by-n Gram matrix of the rows of U and V.

And kernel.m must be on the MATLAB® path.

It is good practice to avoid using generic names for kernel functions. For example, call a sigmoid kernel function 'mysigmoid' rather than 'sigmoid'.

Example: 'KernelFunction','gaussian'

Data Types: char

'KernelOffset' — Kernel offset parameternonnegative scalar

Kernel offset parameter, specified as the comma-separated pair consisting of 'KernelOffset' and a nonnegative scalar.

The software adds KernelOffset to each element of the Gram matrix.

The defaults are:

  • 0 if the solver is SMO (for example, you set 'Solver','SMO')

  • 0.1 if the solver is ISDA (for example, you set 'Solver','ISDA')

Example: 'KernelOffset',0

Data Types: double | single

'KernelScale' — Kernel scale parameter1 (default) | 'auto' | positive scalar

Kernel scale parameter, specified as the comma-separated pair consisting of 'KernelScale' and 'auto' or a positive scalar.

  • If KernelFunction is 'gaussian' ('rbf'), 'linear', or 'polymonial', then the software divides all elements of the predictor matrix X by the value of KernelScale. Then, the software applies the appropriate kernel norm to compute the Gram matrix.

  • If you specify 'auto', then the software uses a heuristic procedure to select the scale value. The heuristic procedure uses subsampling. Therefore, to reproduce results, set a random number seed using rng before training the classifier.

  • If you specify KernelScale and your own kernel function, for example, kernel, using 'KernelFunction','kernel', then the software displays an error. You must apply scaling within kernel.

Example: 'KernelScale',''auto'

Data Types: double | single | char

'KFold' — Number of folds10 (default) | positive integer value

Number of folds to use in a cross-validated classifier, specified as the comma-separated pair consisting of 'KFold' and a positive integer value.

You can only use one of these four options at a time to create a cross-validated model: 'KFold', 'Holdout', 'Leaveout', or 'CVPartition'.

Example: 'KFold',8

Data Types: single | double

'KKTTolerance' — Karush-Kuhn-Tucker complementarity conditions violation tolerancenonnegative scalar

Karush-Kuhn-Tucker (KKT) complementarity conditions violation tolerance, specified as the comma-separated pair consisting of 'KKTTolerance' and a nonnegative scalar.

If KKTTolerance is 0, then the software does not use the KKT complementarity conditions violation tolerance to check for optimization convergence.

The defaults are:

  • 0 if the solver is SMO (for example, you set 'Solver','SMO')

  • 1e-3 if the solver is ISDA (for example, you set 'Solver','ISDA')

Example: 'KKTTolerance',1e-2

Data Types: double | single

'Leaveout' — Leave-one-out cross-validation flag'off' (default) | 'on'

Leave-one-out cross-validation flag, specified as the comma-separated pair consisting of 'Leaveout' and either 'on' or 'off'. If you specify 'on', then the software implements leave-one-out cross validation.

If you use 'Leaveout', you cannot use these 'CVPartition', 'Holdout', or 'KFold' name-value pair arguments.

Example: 'Leaveout','on'

Data Types: char

'Nu'ν parameter for one-class learning0.5 (default) | positive scalar

ν parameter for one-class learning, specified as the comma-separated pair consisting of 'Nu' and a positive scalar. Nu must be greater than 0 and at most 1.

Set Nu to control the tradeoff between ensuring most training examples are in the positive class and minimizing the weights in the score function.

Example: 'Nu',0.25

Data Types: double | single

'NumPrint' — Number of iterations between optimization diagnostic message output1000 (default) | nonnegative integer

Number of iterations between optimization diagnostic message output, specified as the comma-separated pair consisting of 'NumPrint' and a nonnegative integer.

If you use 'Verbose',1 and 'NumPrint',numprint, then the software displays all optimization diagnostic messages from SMO and ISDA every numprint iterations in the Command Window.

Example: 'NumPrint',500

Data Types: double | single

'OutlierFraction' — Expected proportion of outliers in training data0 (default) | nonnegative scalar

Expected proportion of outliers in the training data, specified as the comma-separated pair consisting of 'OutlierFraction' and a nonnegative scalar. OutlierFraction must be at least 0 and less than 1.

If you set 'OutlierFraction',outlierfraction, where outlierfraction is a value greater than 0, then:

  • For two-class learning, the software implements robust learning. In other words, the software attempts to remove 100*outlierfraction% of the observations when the optimization algorithm converges. The removed observations correspond to gradients that are large in magnitude.

  • For one-class learning, the software finds an appropriate bias term such that outlierfraction of the observations in the training set have negative scores.

Example: 'OutlierFraction',0.01

Data Types: double | single

'PolynomialOrder' — Polynomial kernel function order3 (default) | positive integer

Polynomial kernel function order, specified as the comma-separated pair consisting of 'PolynomialOrder' and a positive integer.

If you set 'PolynomialOrder' and KernelFunction is not 'polynomial', then the software displays an error.

Example: 'PolynomialOrder',2

Data Types: double | single

'PredictorNames' — Predictor variable names{'x1','x2',...} (default) | cell array of strings

Predictor variable names, specified as the comma-separated pair consisting of 'PredictorNames' and a cell array of strings containing the names for the predictor variables, in the order in which they appear in X.

Example: 'PredictorNames',{'PedalWidth','PedalLength'}

Data Types: cell

'Prior' — Prior probabilities'empirical' (default) | 'uniform' | numeric vector | structure

Prior probabilities for each class, specified as the comma-separated pair consisting of 'Prior' and a string, numeric vector, or a structure.

This table summarizes the available options for setting prior probabilities.

ValueDescription
'empirical'The class prior probabilities are the class relative frequencies in Y.
'uniform'All class prior probabilities are equal to 1/K, where K is the number of classes.
numeric vectorEach element is a class prior probability. Order the elements according to SVMModel.ClassNames or specify the order using the ClassNames name-value pair argument. The software normalizes the elements such that they sum to 1.
structure

A structure S with two fields:

  • S.ClassNames contains the class names as a variable of the same type as Y.

  • S.ClassProbs contains a vector of corresponding prior probabilities. The software normalizes the elements such that they sum to 1.

For two-class learning, if you specify a cost matrix, then the software updates the prior probabilities by incorporating the penalties described in the cost matrix. For more details, see Algorithms.

Example: struct('ClassNames',{{'setosa','versicolor'}},'ClassProbs',[1,2])

Data Types: char | double | single | struct

'ResponseName' — Response variable name'Y' (default) | string

Response variable name, specified as the comma-separated pair consisting of 'ResponseName' and a string containing the name of the response variable Y.

Example: 'ResponseName','IrisType'

Data Types: char

'ScoreTransform' — Score transform function'none' (default) | 'doublelogit' | 'invlogit' | 'ismax' | 'logit' | 'sign' | 'symmetric' | 'symmetriclogit' | 'symmetricismax' | function handle

Score transform function, specified as the comma-separated pair consisting of 'ScoreTransform' and a string or function handle.

  • If the value is a string, then it must correspond to a built-in function. This table summarizes the available, built-in functions.

    StringFormula
    'doublelogit'1/(1 + e–2x)
    'invlogit'log(x / (1–x))
    'ismax'Set the score for the class with the largest score to 1, and scores for all other classes to 0.
    'logit'1/(1 + ex)
    'none'x (no transformation)
    'sign'–1 for x < 0
    0 for x = 0
    1 for x > 0
    'symmetric'2x – 1
    'symmetriclogit'2/(1 + ex) – 1
    'symmetricismax'Set the score for the class with the largest score to 1, and scores for all other classes to -1.

  • For a MATLAB function, or a function that you define, enter its function handle.

    SVMModel.ScoreTransform = @function;

    function should accept a matrix (the original scores) and return a matrix of the same size (the transformed scores).

Example: 'ScoreTransform','sign'

Data Types: char | function_handle

'ShrinkagePeriod' — Number of iterations between movement of observations from active to inactive set0 (default) | nonnegative integer

Number of iterations between the movement of observations from the active to inactive set, specified as the comma-separated pair consisting of 'ShrinkagePeriod' and a nonnegative integer.

If you set 'ShrinkagePeriod',0, then the software does not shrink the active set.

Example: 'ShrinkagePeriod',1000

Data Types: double | single

'Solver' — Optimization routine'ISDA' | 'L1QP' | 'SMO'

Optimization routine, specified as a string.

This table summarizes the available optimization routine options.

ValueDescription
'ISDA'Iterative Single Data Algorithm (see [4])
'L1QP'Uses quadprog to implement L1 soft-margin minimization by quadratic programming. This option requires an Optimization Toolbox™ license. For more details, see Quadratic Programming Definition.
'SMO'Sequential Minimal Optimization (see [2])

The defaults are:

  • 'ISDA' if you set 'OutlierFraction' to a positive value and for two-class learning

  • 'SMO' otherwise

Example: 'Solver','ISDA'

Data Types: char

'Standardize' — Flag to standardize predictorsfalse (default) | true

Flag to standardize the predictors, specified as the comma-separated pair consisting of 'Standardize' and true (1) or false (0).

If you set 'Standardize',true, then the software centers and scales each column of the predictor data (X) by the column mean and standard deviation, respectively. It is good practice to standardize the predictor data.

Example: 'Standardize',true

Data Types: logical

'Verbose' — Verbosity level0 (default) | 1 | 2

Verbosity level, specified as the comma-separated pair consisting of 'Verbose' and either 0, 1, or 2. Verbose controls the amount of optimization information that the software displays in the Command Window and saves as a structure to SVMModel.ConvergenceInfo.History.

This table summarizes the available verbosity level options.

ValueDescription
0The software does not display or save convergence information.
1The software displays diagnostic messages and saves convergence criteria every numprint iterations, where numprint is the value of the name-value pair argument 'NumPrint'.
2The software displays diagnostic messages and saves convergence criteria at every iteration.

Example: 'Verbose',1

Data Types: double | single

'Weights' — Observation weightsones(size(X,1),1) (default) | numeric vector

Observation weights, specified as the comma-separated pair consisting of 'Weights' and a numeric vector.

The size of Weights must equal the number of rows of X. The software weighs the observations in each row of X with the corresponding weight in Weights.

The software normalizes Weights to sum up to the value of the prior probability in the respective class.

Data Types: double | single

Output Arguments

expand all

SVMModel — Trained SVM classifierClassificationSVM classifier | ClassificationPartitionedModel cross-validated classifier

Trained SVM classifier, returned as a ClassificationSVM classifier or ClassificationPartitionedModel cross-validated classifier.

If you set any of the name-value pair arguments KFold, Holdout, Leaveout, CrossVal, or CVPartition, then SVMModel is a ClassificationPartitionedModel cross-validated classifier. Otherwise, SVMModel is a ClassificationSVM classifier.

To reference properties of SVMModel, use dot notation. For example, enter SVMModel.Alpha in the Command Window to display the trained Lagrange multipliers.

Limitations

  • fitcsvm trains SVM classifiers for one- or two-class learning applications. To train SVM classifiers using data with more than two classes, use fitcecoc.

More About

expand all

Box Constraint

A parameter that controls the maximum penalty imposed on margin-violating observations, and aids in preventing overfitting (regularization).

If you increase the box constraint, then the SVM classifier assigns fewer support vectors. However, increasing the box constraint can lead to longer training times.

Gram Matrix

The Gram matrix of a set of n vectors {x1,..,xn; xjRp} is an n-by-n matrix with element (j,k) defined as G(xj,xk) = <ϕ(xj),ϕ(xk)>, an inner product of the transformed predictors using the kernel function ϕ.

For nonlinear SVM, the algorithm forms a Gram matrix using the predictor matrix columns. The dual formalization replaces the inner product of the predictors with corresponding elements of the resulting Gram matrix (called the "kernel trick"). Subsequently, nonlinear SVM operates in the transformed predictor space to find a separating hyperplane.

Karush-Kuhn-Tucker Complementarity Conditions

KKT complementarity conditions are optimization constraints required for optimal nonlinear programming solutions.

In SVM, the KKT complementarity conditions are

{αj[yj(wϕ(xj)+b)1+ξj]=0ξj(Cαj)=0

for all j = 1,...,n, where wj is a weight, ϕ is a kernel function (see Gram matrix), and ξj is a slack variable. If the classes are perfectly separable, then ξj = 0 for all j = 1,...,n.

One-Class Learning

One-class learning, or unsupervised SVM, aims at separating data from the origin in the high-dimensional, predictor space (not the original predictor space), and is an algorithm used for outlier detection.

The algorithm resembles that of SVM for binary classification. The objective is to minimize dual expression

0.5jkαjαkG(xj,xk)

with respect to α1,...,αn, subject to

αj=nν

and 0αj1 for all j = 1,...,n. G(xj,xk,) is element (j,k) of the Gram matrix.

A small value of ν leads to fewer support vectors, and, therefore, a smooth, crude decision boundary. A large value of ν leads to more support vectors, and therefore, a curvy, flexible decision boundary. The optimal value of ν should be large enough to capture the data complexity and small enough to avoid overtraining. Also, 0 < ν ≤ 1.

For more details, see [5].

Support Vector

Support vectors are observations corresponding to strictly positive estimates of α1,...,αn.

SVM classifiers that yield fewer support vectors for a given training set are more desirable.

Support Vector Machines for Binary Classification

The SVM binary classification algorithm searches for an optimal hyperplane that separates the data into two classes. For separable classes, the optimal hyperplane maximizes a margin (space that does not contain any observations) surrounding itself, which creates boundaries for the positive and negative classes. For inseparable classes, the objective is the same, but the algorithm imposes a penalty on the length of the margin for every observation that is on the wrong side of its class boundary.

The linear SVM score function is

f(x)=xβ+β0,

where:

  • x is an observation (corresponding to a row of X).

  • The vector β contains the coefficients that define an orthogonal vector to the hyperplane (corresponding to SVMModel.Beta). For separable data, the optimal margin length is 2/β.

  • β0 is the bias term (corresponding to SVMModel.Bias).

The root of f(x) for particular coefficients defines a hyperplane. For a particular hyperplane, f(z) is the distance from point z to the hyperplane.

An SVM classifier searches for the maximum margin length, while keeping observations in the positive (y = 1) and negative (y = –1) classes separate. Therefore:

  • For separable classes, the objective is to minimize β with respect to the β and β0 subject to yjf(xj) ≥ 1, for all j = 1,..,n. This is the primal formalization for separable classes.

  • For inseparable classes, SVM uses slack variables (ξj) to penalize the objective function for observations that cross the margin boundary for their class. ξj = 0 for observations that do not cross the margin boundary for their class, otherwise ξj ≥ 0.

    The objective is to minimize0.5β2+Cξj with respect to the β, β0, and ξj subject to yjf(xj)1ξj and ξj0 for all j = 1,..,n, and for a positive scalar box constraint C. This is the primal formalization for inseparable classes.

SVM uses the Lagrange multipliers method to optimize the objective. This introduces n coefficients α1,...,αn (corresponding to SVMModel.Alpha). The dual formalizations for linear SVM are:

  • For separable classes, minimize

    0.5j=1nk=1nαjαkyjykxjxkj=1nαj

    with respect to α1,...,αn, subject to αjyj=0, αj ≥ 0 for all j = 1,...,n, and Karush-Kuhn-Tucker (KKT) complementarity conditions.

  • For inseparable classes, the objective is the same as for separable classes, except for the additional condition 0αjC for all j = 1,..,n.

The resulting score function is

f(x)=j=1nα^jyjxxj+b^.

The score function is free of the estimate of β as a result of the primal formalization.

In some cases, there is a nonlinear boundary separating the classes. Nonlinear SVM works in a transformed predictor space to find an optimal, separating hyperplane.

The dual formalization for nonlinear SVM is

0.5j=1nk=1nαjαkyjykG(xj,xk)j=1nαj

with respect to α1,...,αn, subject to αjyj=0, 0αjC for all j = 1,..,n, and the KKT complementarity conditions.G(xk,xj) are elements of the Gram matrix. The resulting score function is

f(x)=j=1nαjyjG(x,xj)+b.

For more details, see Understanding Support Vector Machines, [1], and [3].

Tips

  • For one-class learning:

    • The default setting for the name-value pair argument 'Alpha' can lead to long training times. To speed up training, set Alpha to a vector mostly composed of 0s.

    • Set the name-value pair argument Nu to a value closer to 0 to yield fewer support vectors, and, therefore, a smoother, but crude decision boundary

  • Sparsity in support vectors is a desirable property of an SVM classifier. To decrease the number of support vectors, set BoxConstraint to a large value. This also increases the training time.

  • For large data sets, try optimizing the cache size. This can have a significant impact on the training speed.

  • If the support vector set is much less than the number of observations in the training set, then you might significantly speed up convergence by shrinking the active-set using the name-value pair argument 'ShrinkagePeriod'. It is good practice to use 'ShrinkagePeriod',1000.

Algorithms

  • All solvers implement L1 soft-margin minimization.

  • fitcsvm and svmtrain use, among other algorithms, SMO for optimization. The software implements SMO differently between the two functions, but numerical studies show that there is sensible agreement in the results.

  • For one-class learning, the software estimates the Lagrange multipliers, α1,...,αn, such that

    j=1nαj=nν.

  • For two-class learning, if you specify a cost matrix C, then the software updates the class prior probabilities (p) to pc by incorporating the penalties described in C. The formula for the updated prior probability vector is

    pc=pCpC.

    Subsequently, the software resets the cost matrix to the default:

    C=[0110].

  • If you set 'Standardize',true when you train the SVM classifier using fitcsvm, then the software trains the classifier using the standardized predictor matrix, but stores the unstandardized data in the classifier property X. However, if you standardize the data, then the data size in memory doubles until optimization ends.

  • If you set 'Standardize',true and any of 'Cost', 'Prior', or 'Weights', then the software standardizes the predictors using their corresponding weighted means and weighted standard deviations.

  • Let p be the proportion of outliers you expect in the training data. If you use 'OutlierFraction',p when you train the SVM classifier using fitcsvm, then:

    • For one-class learning, the software trains the bias term such that 100p% of the observations in the training data have negative scores.

    • The software implements robust learning for two-class learning. In other words, the software attempts to remove 100p% of the observations when the optimization algorithm converges. The removed observations correspond to gradients that are large in magnitude.

References

[1] Christianini, N., and J. C. Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge, UK: Cambridge University Press, 2000.

[2] Fan, R.-E., P.-H. Chen, and C.-J. Lin. "Working set selection using second order information for training support vector machines." Journal of Machine Learning Research, Vol 6, 2005, pp. 1889–1918.

[3] Hastie, T., R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, Second Edition. NY: Springer, 2008.

[4] Kecman V., T. -M. Huang, and M. Vogt. "Iterative Single Data Algorithm for Training Kernel Machines from Huge Data Sets: Theory and Performance." In Support Vector Machines: Theory and Applications. Edited by Lipo Wang, 255–274. Berlin: Springer-Verlag, 2005.

[5] Scholkopf, B., J. C. Platt, J. C. Shawe-Taylor, A. J. Smola, and R. C. Williamson. "Estimating the Support of a High-Dimensional Distribution." Neural Comput., Vol. 13, Number 7, 2001, pp. 1443–1471.

[6] Scholkopf, B., and A. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond, Adaptive Computation and Machine Learning. Cambridge, MA: The MIT Press, 2002.

Was this topic helpful?