Support vector machine (SVM) analysis is a popular machine learning tool for classification and regression, first identified by Vladimir Vapnik and his colleagues in 1992[5]. SVM regression is considered a nonparametric technique because it relies on kernel functions.

Statistics and Machine Learning Toolbox™ implements linear epsilon-insensitive
SVM (ε-SVM) regression, which is also known as *L*1
loss. In *ε*-SVM regression, the set of training
data includes predictor variables and observed response values. The
goal is to find a function *f*(*x*) that
deviates from *y _{n}* by a value
no greater than ε for each training point

Suppose we have a set of training data where *x _{n}* is
a multivariate set of

To find the linear function

$$f(x)=x\prime \beta +b,$$

and ensure that it is as flat as possible, find *f*(*x*) with
the minimal norm value (*β*′*β*).
This is formulated as a convex optimization problem to minimize

$$J\left(\beta \right)=\frac{1}{2}\beta \prime \beta $$

subject to all residuals having a value less than ε; or, in equation form:

$$\forall n:\left|{y}_{n}-\left({x}_{n}\prime \beta +b\right)\right|\le \epsilon \text{\hspace{0.17em}}.$$

It is possible that no such function *f*(*x*) exists
to satisfy these constraints for all points. To deal with otherwise
infeasible constraints, introduce slack variables *ξ _{n}* and

Including slack variables leads to the objective function, also known as the primal formula[5]:

$$J\left(\beta \right)=\frac{1}{2}\beta \prime \beta +C{\displaystyle \sum _{n=1}^{N}\left({\xi}_{n}+{\xi}_{n}^{*}\right)}\text{\hspace{0.17em}},$$

subject to:

$$\begin{array}{l}\forall n:{y}_{n}-\left({x}_{n}\prime \beta +b\right)\le \epsilon +{\xi}_{n}\\ \forall n:\left({x}_{n}\prime \beta +b\right)-{y}_{n}\le \epsilon +{\xi}_{n}^{*}\\ \forall n:{\xi}_{n}^{*}\ge 0\\ \forall n:{\xi}_{n}\ge 0\text{\hspace{0.17em}}.\end{array}$$

The constant *C* is the box constraint, a positive
numeric value that controls the penalty imposed on observations that
lie outside the epsilon margin (*ε*) and helps
to prevent overfitting (regularization). This value determines the
trade-off between the flatness of *f*(*x*) and
the amount up to which deviations larger than *ε* are
tolerated.

The linear ε-insensitive loss function ignores errors
that are within *ε* distance of the observed
value by treating them as equal to zero. The loss is measured based
on the distance between observed value *y* and the *ε* boundary.
This is formally described by

$${L}_{\epsilon}=\{\begin{array}{ll}0\hfill & \text{if}\left|y-f\left(x\right)\right|\le \epsilon \hfill \\ \left|y-f\left(x\right)\right|-\epsilon \hfill & \text{otherwise}\hfill \end{array}$$

The optimization problem previously described is computationally simpler to solve in its Lagrange dual formulation. The solution to the dual problem provides a lower bound to the solution of the primal (minimization) problem. The optimal values of the primal and dual problems need not be equal, and the difference is called the “duality gap.” But when the problem is convex and satisfies a constraint qualification condition, the value of the optimal solution to the primal problem is given by the solution of the dual problem.

To obtain the dual formula, construct a Lagrangian function
from the primal function by introducing nonnegative multipliers *α _{n}* and

$$L\left(\alpha \right)=\frac{1}{2}{\displaystyle \sum _{i=1}^{N}{\displaystyle \sum _{j=1}^{N}\left({\alpha}_{i}-{\alpha}_{i}^{*}\right)\left({\alpha}_{j}-{\alpha}_{j}^{*}\right){x}_{i}\prime {x}_{j}+\epsilon {\displaystyle \sum _{i=1}^{N}\left({\alpha}_{i}+{\alpha}_{i}^{*}\right)}}}+{\displaystyle \sum _{i=1}^{N}{y}_{i}\left({\alpha}_{i}^{*}-{\alpha}_{i}\right)}$$

subject to the constraints

$$\begin{array}{l}{\displaystyle \sum _{n=1}^{N}\left({\alpha}_{n}-{\alpha}_{n}^{*}\right)=0}\\ \forall n:0\le {\alpha}_{n}\le C\\ \forall n:0\le {\alpha}_{n}^{*}\le C\text{\hspace{0.17em}}.\end{array}$$

The *β* parameter can be completely described
as a linear combination of the training observations using the equation

$$\beta ={\displaystyle \sum _{n=1}^{N}\left({\alpha}_{n}-{\alpha}_{n}^{*}\right){x}_{n}\text{\hspace{0.17em}}.}$$

The function used to predict new values depends only on the support vectors:

$$f\left(x\right)={\displaystyle \sum _{n=1}^{N}\left({\alpha}_{n}-{\alpha}_{n}^{*}\right)\left({x}_{n}\prime x\right)+b\text{\hspace{0.17em}}.}$$ | (1) |

The Karush-Kuhn-Tucker (KKT) complementarity conditions are optimization constraints required to obtain optimal solutions. For linear SVM regression, these conditions are

$$\begin{array}{l}\forall n:{\alpha}_{n}\left(\epsilon +{\xi}_{n}-{y}_{n}+{x}_{n}\prime \beta +b\right)=0\\ \forall n:{\alpha}_{n}^{*}\left(\epsilon +{\xi}_{n}^{*}+{y}_{n}-{x}_{n}\prime \beta -b\right)=0\\ \forall n:{\xi}_{n}\left(C-{\alpha}_{n}\right)=0\\ \forall n:{\xi}_{n}^{*}\left(C-{\alpha}_{n}^{*}\right)=0\text{\hspace{0.17em}}.\end{array}$$

These conditions indicate that all observations strictly inside the epsilon
tube have Lagrange multipliers *α _{n}* = 0 and

The property Alpha of a trained SVM
model stores the difference between two Lagrange multipliers of support vectors, *α _{n}* –

Some regression problems cannot adequately be described using a linear model. In such a case, the Lagrange dual formulation allows the previously-described technique to be extended to nonlinear functions.

Obtain a nonlinear SVM regression model by replacing the dot
product *x*_{1}′*x*_{2} with
a nonlinear kernel function *G*(*x*_{1},*x*_{2})
= <*φ*(*x*_{1}),*φ*(*x*_{2})>,
where *φ*(*x*) is a transformation
that maps *x* to a high-dimensional space. Statistics and Machine Learning Toolbox provides
the following built-in semidefinite kernel functions.

Kernel Name | Kernel Function |
---|---|

Linear (dot product) | $$G({x}_{j},{x}_{k})={x}_{j}\prime {x}_{k}$$ |

Gaussian | $$G\left({x}_{j},{x}_{k}\right)=\mathrm{exp}\left(-{\Vert {x}_{j}-{x}_{k}\Vert}^{2}\right)$$ |

Polynomial | $$G({x}_{j},{x}_{k})={(1+{x}_{j}\prime {x}_{k})}^{q}$$, where q is in the set
{2,3,...}. |

The *Gram matrix* is an *n*-by-*n* matrix
that contains elements *g _{i},_{j}* =
G(

The dual formula for nonlinear SVM regression replaces the inner
product of the predictors (*x _{i}*′

Nonlinear SVM regression finds the coefficients that minimize

$$L\left(\alpha \right)=\frac{1}{2}{\displaystyle \sum _{i=1}^{N}{\displaystyle \sum _{j=1}^{N}\left({\alpha}_{i}-{\alpha}_{i}^{*}\right)\left({\alpha}_{j}-{\alpha}_{j}^{*}\right)G\left({x}_{i},{x}_{j}\right)+\epsilon {\displaystyle \sum _{i=1}^{N}\left({\alpha}_{i}+{\alpha}_{i}^{*}\right)-{\displaystyle \sum _{i=1}^{N}{y}_{i}\left({\alpha}_{i}-{\alpha}_{i}^{*}\right)}}}}$$

subject to

$$\begin{array}{l}{\displaystyle \sum _{n=1}^{N}\left({\alpha}_{n}-{\alpha}_{n}^{*}\right)}=0\\ \forall n:0\le {\alpha}_{n}\le C\\ \forall n:0\le {\alpha}_{n}^{*}\le C\text{\hspace{0.17em}}.\end{array}$$

The function used to predict new values is equal to

$$f\left(x\right)={\displaystyle \sum _{n=1}^{N}\left({\alpha}_{n}-{\alpha}_{n}^{*}\right)G\left({x}_{n},x\right)+b}\text{\hspace{0.17em}}.$$ | (2) |

The KKT complementarity conditions are

$$\begin{array}{l}\forall n:{\alpha}_{n}\left(\epsilon +{\xi}_{n}-{y}_{n}+f\left({x}_{n}\right)\right)=0\\ \forall n:{\alpha}_{n}^{*}\left(\epsilon +{\xi}_{n}^{*}+{y}_{n}-f\left({x}_{n}\right)\right)=0\\ \forall n:{\xi}_{n}\left(C-{\alpha}_{n}\right)=0\\ \forall n:{\xi}_{n}^{*}\left(C-{\alpha}_{n}^{*}\right)=0\text{\hspace{0.17em}}.\end{array}$$

The minimization problem can be expressed in standard quadratic programming form and solved using common quadratic programming techniques. However, it can be computationally expensive to use quadratic programming algorithms, especially since the Gram matrix may be too large to be stored in memory. Using a decomposition method instead can speed up the computation and avoid running out of memory.

*Decomposition methods* (also called *chunking
and working set methods*) separate all observations into
two disjoint sets: the working set and the remaining set. A decomposition
method modifies only the elements in the working set in each iteration.
Therefore, only some columns of the Gram matrix are needed in each
iteration, which reduces the amount of storage needed for each iteration.

*Sequential minimal optimization* (SMO) is
the most popular approach for solving SVM problems[4]. SMO performs a series of two-point
optimizations. In each iteration, a working set of two points are
chosen based on a selection rule that uses second-order information.
Then the Lagrange multipliers for this working set are solved analytically
using the approach described in [2] and [1].

In SVM regression, the gradient vector $$\nabla L$$ for the active set is updated after each iteration. The decomposed equation for the gradient vector is

$${\left(\nabla L\right)}_{n}=\{\begin{array}{c}{\displaystyle \sum _{i=1}^{N}\left({\alpha}_{i}-{\alpha}_{i}^{*}\right)G\left({x}_{i},{x}_{n}\right)+\epsilon -{y}_{n}\text{\hspace{0.17em}},\text{\hspace{0.17em}}n\le N}\\ -{\displaystyle \sum _{i=1}^{N}\left({\alpha}_{i}-{\alpha}_{i}^{*}\right)G\left({x}_{i},{x}_{n}\right)+\epsilon +{y}_{n}\text{\hspace{0.17em}},\text{\hspace{0.17em}}n>N}\end{array}\text{\hspace{0.17em}}.$$

*Iterative single data algorithm* (ISDA)
updates one Lagrange multiplier with each iteration[3]. ISDA is often
conducted without the bias term *b* by adding a small
positive constant *a* to the kernel function. Dropping *b* drops
the sum constraint

$$\sum _{n=1}^{N}\left({\alpha}_{i}-{\alpha}^{*}\right)=0$$

in the dual equation. This allows us to update one Lagrange multiplier in each iteration,
which makes it easier than SMO to remove outliers. ISDA selects the worst KKT
violator among all the *α*_{n} and *α*_{n}^{*} values as the working set to be updated.

Each of these solver algorithms iteratively computes until the specified convergence criterion is met. There are several options for convergence criteria:

*Feasibility gap*— The feasibility gap is expressed as$$\Delta =\frac{J\left(\beta \right)+L\left(\alpha \right)}{J\left(\beta \right)+1}\text{\hspace{0.17em}},$$

where

*J*(*β*) is the primal objective and*L*(*α*) is the dual objective. After each iteration, the software evaluates the feasibility gap. If the feasibility gap is less than the value specified by`GapTolerance`

, then the algorithm met the convergence criterion and the software returns a solution.*Gradient difference*— After each iteration, the software evaluates the gradient vector, $$\nabla L$$. If the difference in gradient vector values for the current iteration and the previous iteration is less than the value specified by`DeltaGradientTolerance`

, then the algorithm met the convergence criterion and the software returns a solution.*Largest KKT violation*— After each iteration, the software evaluates the KKT violation for all the*α*_{n}and*α*_{n}^{*}values. If the largest violation is less than the value specified by`KKTTolerance`

, then the algorithm met the convergence criterion and the software returns a solution.

[1] Fan, R.E. , P.H. Chen, and C.J. Lin. *A
Study on SMO-Type Decomposition Methods for Support Vector Machines.* IEEE
Transactions on Neural Networks, 17:893–908, 2006.

[2] Fan, R.E. , P.H. Chen, and C.J. Lin. *Working
Set Selection Using Second Order Information for Training Support
Vector Machines.* The Journal of machine Learning Research,
6:1871–1918, 2005.

[3] Huang, T.M., V. Kecman, and I. Kopriva. *Kernel
Based Algorithms for Mining Huge Data Sets: Supervised, Semi-Supervised,
and Unsupervised Learning.* Springer, New York, 2006.

[4] Platt, J. *Sequential Minimal
Optimization: A Fast Algorithm for Training Support Vector Machines.* Technical
Report MSR-TR-98–14, 1999.

[5] Vapnik, V. *The Nature of Statistical
Learning Theory.* Springer, New York, 1995.

`fitrsvm`

| `predict`

| `RegressionSVM`

| `resubPredict`