templateLinear

Linear learner template

Syntax

t = templateLinear

t = templateLinear(Name,Value)

Description

t = templateLinear returns a linear learner template suitable for training a linear classification or regression model on high-dimensional data.

example

t = templateLinear(Name,Value) returns a template with additional options specified by one or more name-value arguments.

For example, you can specify the regularization type or strength, or specify the solver to use for objective-function minimization. If you do not specify the learner, then the default value "svm" is used.

If you specify the type of model by using the Type name-value argument, then the display of t in the Command Window shows all options as empty ([]), except those that you specify using name-value arguments. If you do not specify the type of model, then the display suppresses the empty options. During training, the software uses default values for empty options.

example

Examples

collapse all

Train Multiclass Linear Classification Model

Open Live Script

Create a default linear learner template, and then use it to train an ECOC model containing multiple binary linear classification models.

Load the NLP data set.

load nlpdata

X is a sparse matrix of predictor data, and Y is a categorical vector of class labels. The data contains 13 classes.

Create a default linear learner template.

t = templateLinear

t = 
Fit template for Linear.
    Learner: 'svm'

t is a template object for a linear learner. All of the properties of t are empty. When you pass t to a training function, such as fitcecoc for ECOC multiclass classification, the software sets the empty properties to their respective default values. For example, the software sets Type to "classification". To modify the default values see the name-value arguments for templateLinear.

Train an ECOC model consisting of multiple binary linear classification models that identify the software product given the frequency distribution of words on a documentation web page. For faster training time, transpose the predictor data, and specify that observations correspond to columns.

X = X';
rng(1); % For reproducibility 
Mdl = fitcecoc(X,Y,'Learners',t,'ObservationsIn','columns')

Mdl = 
  CompactClassificationECOC
      ResponseName: 'Y'
        ClassNames: [comm    dsp    ecoder    fixedpoint    hdlcoder    phased    physmod    simulink    stats    supportpkg    symbolic    vision    xpc]
    ScoreTransform: 'none'
    BinaryLearners: {78×1 cell}
      CodingMatrix: [13×78 double]


  Properties, Methods

Alternatively, you can train an ECOC model containing default linear classification models by specifying "Learners","Linear".

To conserve memory, fitcecoc returns trained ECOC models containing linear classification learners in CompactClassificationECOC model objects.

Name-Value Arguments

expand all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'Learner','logistic','Regularization','lasso','CrossVal','on' specifies to implement logistic regression with a lasso penalty, and to implement 10-fold cross-validation.

For Classification Models and Regression Models

expand all

`Lambda` — Regularization term strength
`'auto'` (default) | nonnegative scalar | vector of nonnegative values

Regularization term strength, specified as the comma-separated pair consisting of 'Lambda' and 'auto', a nonnegative scalar, or a vector of nonnegative values.

For 'auto', Lambda = 1/n.
- If you specify a cross-validation, name-value pair argument (e.g., CrossVal), then n is the number of in-fold observations.
- Otherwise, n is the training sample size.
For a vector of nonnegative values, templateLinear sequentially optimizes the objective function for each distinct value in Lambda in ascending order.
- If Solver is 'sgd' or 'asgd' and Regularization is 'lasso', templateLinear does not use the previous coefficient estimates as a warm start for the next optimization iteration. Otherwise, templateLinear uses warm starts.
- If Regularization is 'lasso', then any coefficient estimate of 0 retains its value when templateLinear optimizes using subsequent values in Lambda.
- templateLinear returns coefficient estimates for each specified regularization strength.

Example: 'Lambda',10.^(-(10:-2:2))

Data Types: char | string | double | single

`Learner` — Linear learner type
`"svm"` (default) | `"logistic"` | `"leastsquares"`

Linear learner type, specified as "svm", "logistic", or "leastsquares".

In this table, $f (x) = x β + b .$

β is a vector of p coefficients.
x is an observation from p predictor variables.
b is the scalar bias.

Value Algorithm Response Range Loss Function

Value	Algorithm	Response Range	Loss Function
`"svm"`	Support vector machine (classification or regression)	Classification: y ∊ {–1,1}; 1 for the positive class and –1 otherwise Regression: y ∊ (-∞,∞)	Classification: Hinge $ℓ [y, f (x)] = \max [0, 1 - y f (x)]$ Regression: Epsilon-insensitive $ℓ [y, f (x)] = \max [0, \| y - f (x) \| - ε]$
`"logistic"`	Logistic regression (classification only)	y ∊ {–1,1}; 1 for the positive class and –1 otherwise	Deviance (logistic) $ℓ [y, f (x)] = \log {1 + \exp [- y f (x)]}$
`"leastsquares"`	Linear regression via ordinary least squares (regression only)	y ∊ (-∞,∞)	Mean squared error (MSE) $ℓ [y, f (x)] = \frac{1}{2} {[y - f (x)]}^{2}$

"svm"

Support vector machine (classification or regression)

Classification: y ∊ {–1,1}; 1 for the positive class and –1 otherwise

Regression: y ∊ (-∞,∞)

Classification: Hinge $ℓ [y, f (x)] = \max [0, 1 - y f (x)]$

Regression: Epsilon-insensitive $ℓ [y, f (x)] = \max [0, | y - f (x) | - ε]$

"logistic" Logistic regression (classification only) y ∊ {–1,1}; 1 for the positive class and –1 otherwise Deviance (logistic) $ℓ [y, f (x)] = \log {1 + \exp [- y f (x)]}$

"leastsquares" Linear regression via ordinary least squares (regression only) y ∊ (-∞,∞) Mean squared error (MSE) $ℓ [y, f (x)] = \frac{1}{2} {[y - f (x)]}^{2}$

Example: "Learner","logistic"

`Regularization` — Complexity penalty type
`'lasso'` | `'ridge'`

Complexity penalty type, specified as the comma-separated pair consisting of 'Regularization' and 'lasso' or 'ridge'.

The software composes the objective function for minimization from the sum of the average loss function (see Learner) and the regularization term in this table.

Value	Description
`'lasso'`	Lasso (L1) penalty: $λ \sum_{j = 1}^{p} \| β_{j} \|$
`'ridge'`	Ridge (L2) penalty: $\frac{λ}{2} \sum_{j = 1}^{p} β_{j}^{2}$

To specify the regularization term strength, which is λ in the expressions, use Lambda.

The software excludes the bias term (β₀) from the regularization penalty.

If Solver is 'sparsa', then the default value of Regularization is 'lasso'. Otherwise, the default is 'ridge'.

Tip

For predictor variable selection, specify 'lasso'. For more on variable selection, see Introduction to Feature Selection.
For optimization accuracy, specify 'ridge'.

Example: 'Regularization','lasso'

`Solver` — Objective function minimization technique
`'sgd'` | `'asgd'` | `'dual'` | `'bfgs'` | `'lbfgs'` | `'sparsa'` | string array | cell array of character vectors

Objective function minimization technique, specified as the comma-separated pair consisting of 'Solver' and a character vector or string scalar, a string array, or a cell array of character vectors with values from this table.

Value	Description	Restrictions
`'sgd'`	Stochastic gradient descent (SGD) [4][2]
`'asgd'`	Average stochastic gradient descent (ASGD) [7]
`'dual'`	Dual SGD for SVM [1][6]	`Regularization` must be `'ridge'` and `Learner` must be `'svm'`.
`'bfgs'`	Broyden-Fletcher-Goldfarb-Shanno quasi-Newton algorithm (BFGS) [3]	Inefficient if `X` is very high-dimensional. `Regularization` must be `'ridge'`.
`'lbfgs'`	Limited-memory BFGS (LBFGS) [3]	`Regularization` must be `'ridge'`.
`'sparsa'`	Sparse Reconstruction by Separable Approximation (SpaRSA) [5]	`Regularization` must be `'lasso'`.

If you specify:

A ridge penalty (see Regularization) and the predictor data set contains 100 or fewer predictor variables, then the default solver is 'bfgs'.
An SVM model (see Learner), a ridge penalty, and the predictor data set contains more than 100 predictor variables, then the default solver is 'dual'.
A lasso penalty and the predictor data set contains 100 or fewer predictor variables, then the default solver is 'sparsa'.

Otherwise, the default solver is 'sgd'. Note that the default solver can change when you perform hyperparameter optimization. For more information, see Regularization method determines the linear learner solver used during hyperparameter optimization.

If you specify a string array or cell array of solver names, then, for each value in Lambda, the software uses the solutions of solver j as a warm start for solver j + 1.

Example: {'sgd' 'lbfgs'} applies SGD to solve the objective, and uses the solution as a warm start for LBFGS.

Tip

SGD and ASGD can solve the objective function more quickly than other solvers, whereas LBFGS and SpaRSA can yield more accurate solutions than other solvers. Solver combinations like {'sgd' 'lbfgs'} and {'sgd' 'sparsa'} can balance optimization speed and accuracy.
When choosing between SGD and ASGD, consider that:
- SGD takes less time per iteration, but requires more iterations to converge.
- ASGD requires fewer iterations to converge, but takes more time per iteration.
If the predictor data is high-dimensional and Regularization is 'ridge', set Solver to any of these combinations:
- 'sgd'
- 'asgd'
- 'dual' if Learner is 'svm'
- 'lbfgs'
- {'sgd','lbfgs'}
- {'asgd','lbfgs'}
- {'dual','lbfgs'} if Learner is 'svm'
Although you can set other combinations, they often lead to solutions with poor accuracy.
If the predictor data is moderate through low-dimensional and Regularization is 'ridge', set Solver to 'bfgs'.
If Regularization is 'lasso', set Solver to any of these combinations:
- 'sgd'
- 'asgd'
- 'sparsa'
- {'sgd','sparsa'}
- {'asgd','sparsa'}

Example: 'Solver',{'sgd','lbfgs'}

`Beta` — Initial linear coefficient estimates
`zeros(p,1)` (default) | numeric vector | numeric matrix

Initial linear coefficient estimates (β), specified as the comma-separated pair consisting of 'Beta' and a p-dimensional numeric vector or a p-by-L numeric matrix. p is the number of predictor variables after dummy variables are created for categorical variables (for more details, see CategoricalPredictors), and L is the number of regularization-strength values (for more details, see Lambda).

If you specify a p-dimensional vector, then the software optimizes the objective function L times using this process.
1. The software optimizes using Beta as the initial value and the minimum value of Lambda as the regularization strength.
2. The software optimizes again using the resulting estimate from the previous optimization as a warm start, and the next smallest value in Lambda as the regularization strength.
3. The software implements step 2 until it exhausts all values in Lambda.
If you specify a p-by-L matrix, then the software optimizes the objective function L times. At iteration j, the software uses Beta(:,j) as the initial value and, after it sorts Lambda in ascending order, uses Lambda(j) as the regularization strength.

If you set 'Solver','dual', then the software ignores Beta.

Data Types: single | double

`Bias` — Initial intercept estimate
numeric scalar | numeric vector

Initial intercept estimate (b), specified as the comma-separated pair consisting of 'Bias' and a numeric scalar or an L-dimensional numeric vector. L is the number of regularization-strength values (for more details, see Lambda).

If you specify a scalar, then the software optimizes the objective function L times using this process.
1. The software optimizes using Bias as the initial value and the minimum value of Lambda as the regularization strength.
2. The uses the resulting estimate as a warm start to the next optimization iteration, and uses the next smallest value in Lambda as the regularization strength.
3. The software implements step 2 until it exhausts all values in Lambda.
If you specify an L-dimensional vector, then the software optimizes the objective function L times. At iteration j, the software uses Bias(j) as the initial value and, after it sorts Lambda in ascending order, uses Lambda(j) as the regularization strength.
By default:
- If Learner is 'logistic', then let g_j be 1 if Y(j) is the positive class, and -1 otherwise. Bias is the weighted average of the g for training or, for cross-validation, in-fold observations.
- If Learner is 'svm', then Bias is 0.

Data Types: single | double

`FitBias` — Linear model intercept inclusion flag
`true` (default) | `false`

Linear model intercept inclusion flag, specified as the comma-separated pair consisting of 'FitBias' and true or false.

Value	Description
`true`	The software includes the bias term b in the linear model, and then estimates it.
`false`	The software sets b = 0 during estimation.

Example: 'FitBias',false

Data Types: logical

`PostFitBias` — Flag to fit linear model intercept after optimization
`false` (default) | `true`

Flag to fit the linear model intercept after optimization, specified as the comma-separated pair consisting of 'PostFitBias' and true or false.

Value Description

false The software estimates the bias term b and the coefficients β during optimization.

Value	Description
`false`	The software estimates the bias term b and the coefficients β during optimization.
`true`	To estimate b, the software: Estimates β and b using the model Estimates classification scores Refits b by placing the threshold on the classification scores that attains maximum accuracy

true

To estimate b, the software:

Estimates β and b using the model
Estimates classification scores
Refits b by placing the threshold on the classification scores that attains maximum accuracy

If you specify true, then FitBias must be true.

Example: 'PostFitBias',true

Data Types: logical

`Type` — Linear model type
`"classification"` | `"regression"`

Since R2023b

Linear model type, specified as "classification" or "regression".

Value	Description
`"classification"`	Create a classification linear learner template. If you do not specify `Type` as `"classification"`, the fitting functions `fitcecoc`, `testckfold`, and `fitsemigraph` set this value when you pass `t` to them.
`"regression"`	Create a regression linear learner template. If you do not specify `Type` as `"regression"`, the fitting function `directforecaster` sets this value when you pass `t` to it.

Example: "Type","classification"

Data Types: char | string

`Verbose` — Verbosity level
`0` (default) | `1`

Verbosity level, specified as the comma-separated pair consisting of 'Verbose' and either 0 or 1. Verbose controls the display of diagnostic information at the command line.

Value	Description
`0`	`templateLinear` does not display diagnostic information.
`1`	`templateLinear` periodically displays the value of the objective function, gradient magnitude, and other diagnostic information.

Example: 'Verbose',1

Data Types: single | double

For Regression Models Only

expand all

`Epsilon` — Half the width of epsilon-insensitive band
`iqr(Y)/13.49` (default) | nonnegative scalar value

Half the width of the epsilon-insensitive band, specified as a nonnegative scalar value. This argument applies to support vector machine learners only.

The default Epsilon value is iqr(Y)/13.49, which is an estimate of standard deviation using the interquartile range of the response variable Y. If iqr(Y) is equal to zero, then the default Epsilon value is 0.1.

Example: "Epsilon",0.3

Data Types: single | double

SGD and ASGD Solver Options

expand all

`BatchSize` — Mini-batch size
positive integer

Mini-batch size, specified as the comma-separated pair consisting of 'BatchSize' and a positive integer. At each iteration, the software estimates the gradient using BatchSize observations from the training data.

If the predictor data is a numeric matrix, then the default value is 10.
If the predictor data is a sparse matrix, then the default value is max([10,ceil(sqrt(ff))]), where ff = numel(X)/nnz(X), that is, the fullness factor of X.

Example: 'BatchSize',100

Data Types: single | double

`LearnRate` — Learning rate
positive scalar

Learning rate, specified as the comma-separated pair consisting of 'LearnRate' and a positive scalar. LearnRate controls the optimization step size by scaling the subgradient.

If Regularization is 'ridge', then LearnRate specifies the initial learning rate γ₀. templateLinear determines the learning rate for iteration t, γ_t, using

$γ_{t} = \frac{γ_{0}}{{(1 + λ γ_{0} t)}^{c}} .$
- λ is the value of Lambda.
- If Solver is 'sgd', then c = 1.
- If Solver is 'asgd', then c is 0.75 [7].
If Regularization is 'lasso', then, for all iterations, LearnRate is constant.

By default, LearnRate is 1/sqrt(1+max((sum(X.^2,obsDim)))), where obsDim is 1 if the observations compose the columns of the predictor data X, and 2 otherwise.

Example: 'LearnRate',0.01

Data Types: single | double

`OptimizeLearnRate` — Flag to decrease learning rate
`true` (default) | `false`

Flag to decrease the learning rate when the software detects divergence (that is, over-stepping the minimum), specified as the comma-separated pair consisting of 'OptimizeLearnRate' and true or false.

If OptimizeLearnRate is 'true', then:

For the few optimization iterations, the software starts optimization using LearnRate as the learning rate.
If the value of the objective function increases, then the software restarts and uses half of the current value of the learning rate.
The software iterates step 2 until the objective function decreases.

Example: 'OptimizeLearnRate',true

Data Types: logical

`TruncationPeriod` — Number of mini-batches between lasso truncation runs
`10` (default) | positive integer

Number of mini-batches between lasso truncation runs, specified as the comma-separated pair consisting of 'TruncationPeriod' and a positive integer.

After a truncation run, the software applies a soft threshold to the linear coefficients. That is, after processing k = TruncationPeriod mini-batches, the software truncates the estimated coefficient j using

${\hat{β}}_{j}^{*} = {\begin{array}{l} {\hat{β}}_{j} - u_{t} & if {\hat{β}}_{j} > u_{t}, \\ 0 & if | {\hat{β}}_{j} | \leq u_{t}, \\ {\hat{β}}_{j} + u_{t} & if {\hat{β}}_{j} < - u_{t} . \end{array} \begin{array}{r} \end{array}$

For SGD, ${\hat{β}}_{j}$ is the estimate of coefficient j after processing k mini-batches. $u_{t} = k γ_{t} λ .$ γ_t is the learning rate at iteration t. λ is the value of Lambda.
For ASGD, ${\hat{β}}_{j}$ is the averaged estimate coefficient j after processing k mini-batches, $u_{t} = k λ .$

If Regularization is 'ridge', then the software ignores TruncationPeriod.

Example: 'TruncationPeriod',100

Data Types: single | double

SGD and ASGD Convergence Controls

expand all

`BatchLimit` — Maximal number of batches
positive integer

Maximal number of batches to process, specified as the comma-separated pair consisting of 'BatchLimit' and a positive integer. When the software processes BatchLimit batches, it terminates optimization.

By default:
- The software passes through the data PassLimit times.
- If you specify multiple solvers, and use (A)SGD to get an initial approximation for the next solver, then the default value is ceil(1e6/BatchSize). BatchSize is the value of the 'BatchSize' name-value pair argument.
If you specify BatchLimit, then templateLinear uses the argument that results in processing the fewest observations, either BatchLimit or PassLimit.

Example: 'BatchLimit',100

Data Types: single | double

`BetaTolerance` — Relative tolerance on linear coefficients and bias term
`1e-4` (default) | nonnegative scalar

Relative tolerance on the linear coefficients and the bias term (intercept), specified as the comma-separated pair consisting of 'BetaTolerance' and a nonnegative scalar.

Let $B_{t} = [β_{t}^{'} b_{t}]$ , that is, the vector of the coefficients and the bias term at optimization iteration t. If ${‖ \frac{B_{t} - B_{t - 1}}{B_{t}} ‖}_{2} < BetaTolerance$ , then optimization terminates.

If the software converges for the last solver specified in Solver, then optimization terminates. Otherwise, the software uses the next solver specified in Solver.

Example: 'BetaTolerance',1e-6

Data Types: single | double

`NumCheckConvergence` — Number of batches to process before next convergence check
positive integer

Number of batches to process before next convergence check, specified as the comma-separated pair consisting of 'NumCheckConvergence' and a positive integer.

To specify the batch size, see BatchSize.

The software checks for convergence about 10 times per pass through the entire data set by default.

Example: 'NumCheckConvergence',100

Data Types: single | double

`PassLimit` — Maximal number of passes
`1` (default) | positive integer

Maximal number of passes through the data, specified as the comma-separated pair consisting of 'PassLimit' and a positive integer.

The software processes all observations when it completes one pass through the data.

When the software passes through the data PassLimit times, it terminates optimization.

If you specify BatchLimit, then templateLinear uses the argument that results in processing the fewest observations, either BatchLimit or PassLimit.

Example: 'PassLimit',5

Data Types: single | double

Dual SGD Convergence Controls

expand all

`BetaTolerance` — Relative tolerance on linear coefficients and bias term
`1e-4` (default) | nonnegative scalar

Relative tolerance on the linear coefficients and the bias term (intercept), specified as the comma-separated pair consisting of 'BetaTolerance' and a nonnegative scalar.

If you also specify DeltaGradientTolerance, then optimization terminates when the software satisfies either stopping criterion.

If the software converges for the last solver specified in Solver, then optimization terminates. Otherwise, the software uses the next solver specified in Solver.

Example: 'BetaTolerance',1e-6

Data Types: single | double

`DeltaGradientTolerance` — Gradient-difference tolerance
`1` (default) | nonnegative scalar

Gradient-difference tolerance between upper and lower pool Karush-Kuhn-Tucker (KKT) complementarity conditions violators, specified as a nonnegative scalar.

If the magnitude of the KKT violators is less than DeltaGradientTolerance, then the software terminates optimization.
If the software converges for the last solver specified in Solver, then optimization terminates. Otherwise, the software uses the next solver specified in Solver.

Example: 'DeltaGradientTolerance',1e-2

Data Types: double | single

`NumCheckConvergence` — Number of passes through entire data set to process before next convergence check
`5` (default) | positive integer

Number of passes through entire data set to process before next convergence check, specified as the comma-separated pair consisting of 'NumCheckConvergence' and a positive integer.

Example: 'NumCheckConvergence',100

Data Types: single | double

`PassLimit` — Maximal number of passes
`10` (default) | positive integer

Maximal number of passes through the data, specified as the comma-separated pair consisting of 'PassLimit' and a positive integer.

When the software completes one pass through the data, it has processed all observations.

When the software passes through the data PassLimit times, it terminates optimization.

Example: 'PassLimit',5

Data Types: single | double

BFGS, LBFGS, and SpaRSA Convergence Controls

expand all

`BetaTolerance` — Relative tolerance on linear coefficients and bias term
`1e-4` (default) | nonnegative scalar

Relative tolerance on the linear coefficients and the bias term (intercept), specified as a nonnegative scalar.

If you also specify GradientTolerance, then optimization terminates when the software satisfies either stopping criterion.

If the software converges for the last solver specified in Solver, then optimization terminates. Otherwise, the software uses the next solver specified in Solver.

Example: 'BetaTolerance',1e-6

Data Types: single | double

`GradientTolerance` — Absolute gradient tolerance
`1e-6` (default) | nonnegative scalar

Absolute gradient tolerance, specified as a nonnegative scalar.

Let $\nabla ℒ_{t}$ be the gradient vector of the objective function with respect to the coefficients and bias term at optimization iteration t. If ${‖ \nabla ℒ_{t} ‖}_{\infty} = \max | \nabla ℒ_{t} | < GradientTolerance$ , then optimization terminates.

If you also specify BetaTolerance, then optimization terminates when the software satisfies either stopping criterion.

If the software converges for the last solver specified in the software, then optimization terminates. Otherwise, the software uses the next solver specified in Solver.

Example: 'GradientTolerance',1e-5

Data Types: single | double

`HessianHistorySize` — Size of history buffer for Hessian approximation
`15` (default) | positive integer

Size of history buffer for Hessian approximation, specified as the comma-separated pair consisting of 'HessianHistorySize' and a positive integer. That is, at each iteration, the software composes the Hessian using statistics from the latest HessianHistorySize iterations.

The software does not support 'HessianHistorySize' for SpaRSA.

Example: 'HessianHistorySize',10

Data Types: single | double

`IterationLimit` — Maximal number of optimization iterations
`1000` (default) | positive integer

Maximal number of optimization iterations, specified as the comma-separated pair consisting of 'IterationLimit' and a positive integer. IterationLimit applies to these values of Solver: 'bfgs', 'lbfgs', and 'sparsa'.

Example: 'IterationLimit',500

Data Types: single | double

Output Arguments

collapse all

`t` — Linear learner template
template object

Linear learner template suitable for training linear classification or regression models, returned as a template object. During training, the software uses default values for empty options.

More About

collapse all

Warm Start

A warm start is initial estimates of the beta coefficients and bias term supplied to an optimization routine for quicker convergence.

Tips

It is a best practice to orient your predictor matrix so that observations correspond to columns and to specify 'ObservationsIn','columns'. As a result, you can experience a significant reduction in optimization-execution time.
If the predictor data has few observations, but many predictor variables, then:
- Specify 'PostFitBias',true.
- For SGD or ASGD solvers, set PassLimit to a positive integer that is greater than 1, for example, 5 or 10. This setting often results in better accuracy.
For SGD and ASGD solvers, BatchSize affects the rate of convergence.
- If BatchSize is too small, then the software achieves the minimum in many iterations, but computes the gradient per iteration quickly.
- If BatchSize is too large, then the software achieves the minimum in fewer iterations, but computes the gradient per iteration slowly.
Large learning rate (see LearnRate) speed-up convergence to the minimum, but can lead to divergence (that is, over-stepping the minimum). Small learning rates ensure convergence to the minimum, but can lead to slow termination.
If Regularization is 'lasso', then experiment with various values of TruncationPeriod. For example, set TruncationPeriod to 1, 10, and then 100.
For efficiency, the software does not standardize predictor data. To standardize the predictor data (X) where you orient the observations as the columns, enter
```
X = normalize(X,2);
```
If you orient the observations as the rows, enter
```
X = normalize(X);
```
For memory-usage economy, the code replaces the original predictor data the standardized data.

References

[1] Hsieh, C. J., K. W. Chang, C. J. Lin, S. S. Keerthi, and S. Sundararajan. “A Dual Coordinate Descent Method for Large-Scale Linear SVM.” Proceedings of the 25th International Conference on Machine Learning, ICML ’08, 2001, pp. 408–415.

[2] Langford, J., L. Li, and T. Zhang. “Sparse Online Learning Via Truncated Gradient.” J. Mach. Learn. Res., Vol. 10, 2009, pp. 777–801.

[3] Nocedal, J. and S. J. Wright. Numerical Optimization, 2nd ed., New York: Springer, 2006.

[4] Shalev-Shwartz, S., Y. Singer, and N. Srebro. “Pegasos: Primal Estimated Sub-Gradient Solver for SVM.” Proceedings of the 24th International Conference on Machine Learning, ICML ’07, 2007, pp. 807–814.

[5] Wright, S. J., R. D. Nowak, and M. A. T. Figueiredo. “Sparse Reconstruction by Separable Approximation.” Trans. Sig. Proc., Vol. 57, No 7, 2009, pp. 2479–2493.

[6] Xiao, Lin. “Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization.” J. Mach. Learn. Res., Vol. 11, 2010, pp. 2543–2596.

[7] Xu, Wei. “Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent.” CoRR, abs/1107.2490, 2011.

Extended Capabilities

expand all

Tall Arrays
Calculate with arrays that have more rows than fit in memory.

Usage notes and limitations when you train a model by passing a linear model template and tall arrays to fitcecoc:

The default values for these name-value pair arguments are different when you work with tall arrays.
- 'Lambda' — Can be 'auto' (default) or a scalar
- 'Regularization' — Supports only 'ridge'
- 'Solver' — Supports only 'lbfgs'
- 'FitBias' — Supports only true
- 'Verbose' — Default value is 1
- 'BetaTolerance' — Default value is relaxed to 1e–3
- 'GradientTolerance' — Default value is relaxed to 1e–3
- 'IterationLimit' — Default value is relaxed to 20
When fitcecoc uses a templateLinear object with tall arrays, the only available solver is LBFGS. The software implements LBFGS by distributing the calculation of the loss and gradient among different parts of the tall array at each iteration. If you do not specify initial values for Beta and Bias, the software refines the initial estimates of the parameters by fitting the model locally to parts of the data and combining the coefficients by averaging.

For more information, see Tall Arrays.

Version History

Introduced in R2016a

expand all

R2023b: Support for regression learner templates

templateLinear supports the creation of regression learner templates. Specify the Type name-value argument as "regression" in the call to the function. When creating a regression learner template, you can additionally specify the Epsilon name-value argument for support vector machine learners.

templateLinear

Syntax

Description

Examples

Train Multiclass Linear Classification Model

Name-Value Arguments

For Classification Models and Regression Models

Lambda — Regularization term strength 'auto' (default) | nonnegative scalar | vector of nonnegative values

Learner — Linear learner type "svm" (default) | "logistic" | "leastsquares"

Regularization — Complexity penalty type 'lasso' | 'ridge'

Solver — Objective function minimization technique 'sgd' | 'asgd' | 'dual' | 'bfgs' | 'lbfgs' | 'sparsa' | string array | cell array of character vectors

Beta — Initial linear coefficient estimates zeros(p,1) (default) | numeric vector | numeric matrix

Bias — Initial intercept estimate numeric scalar | numeric vector

FitBias — Linear model intercept inclusion flag true (default) | false

PostFitBias — Flag to fit linear model intercept after optimization false (default) | true

Type — Linear model type "classification" | "regression"

Verbose — Verbosity level 0 (default) | 1

For Regression Models Only

Epsilon — Half the width of epsilon-insensitive band iqr(Y)/13.49 (default) | nonnegative scalar value

SGD and ASGD Solver Options

BatchSize — Mini-batch size positive integer

LearnRate — Learning rate positive scalar

OptimizeLearnRate — Flag to decrease learning rate true (default) | false

TruncationPeriod — Number of mini-batches between lasso truncation runs 10 (default) | positive integer

SGD and ASGD Convergence Controls

BatchLimit — Maximal number of batches positive integer

BetaTolerance — Relative tolerance on linear coefficients and bias term 1e-4 (default) | nonnegative scalar

NumCheckConvergence — Number of batches to process before next convergence check positive integer

PassLimit — Maximal number of passes 1 (default) | positive integer

Dual SGD Convergence Controls

BetaTolerance — Relative tolerance on linear coefficients and bias term 1e-4 (default) | nonnegative scalar

DeltaGradientTolerance — Gradient-difference tolerance 1 (default) | nonnegative scalar

NumCheckConvergence — Number of passes through entire data set to process before next convergence check 5 (default) | positive integer

PassLimit — Maximal number of passes 10 (default) | positive integer

BFGS, LBFGS, and SpaRSA Convergence Controls

BetaTolerance — Relative tolerance on linear coefficients and bias term 1e-4 (default) | nonnegative scalar

GradientTolerance — Absolute gradient tolerance 1e-6 (default) | nonnegative scalar

HessianHistorySize — Size of history buffer for Hessian approximation 15 (default) | positive integer

IterationLimit — Maximal number of optimization iterations 1000 (default) | positive integer

Output Arguments

t — Linear learner template template object

More About

Warm Start

Tips

References

Extended Capabilities

Tall Arrays Calculate with arrays that have more rows than fit in memory.

Version History

R2023b: Support for regression learner templates

See Also

`Lambda` — Regularization term strength
`'auto'` (default) | nonnegative scalar | vector of nonnegative values

`Learner` — Linear learner type
`"svm"` (default) | `"logistic"` | `"leastsquares"`

`Regularization` — Complexity penalty type
`'lasso'` | `'ridge'`

`Solver` — Objective function minimization technique
`'sgd'` | `'asgd'` | `'dual'` | `'bfgs'` | `'lbfgs'` | `'sparsa'` | string array | cell array of character vectors

`Beta` — Initial linear coefficient estimates
`zeros(p,1)` (default) | numeric vector | numeric matrix

`Bias` — Initial intercept estimate
numeric scalar | numeric vector

`FitBias` — Linear model intercept inclusion flag
`true` (default) | `false`

`PostFitBias` — Flag to fit linear model intercept after optimization
`false` (default) | `true`

`Type` — Linear model type
`"classification"` | `"regression"`

`Verbose` — Verbosity level
`0` (default) | `1`

`Epsilon` — Half the width of epsilon-insensitive band
`iqr(Y)/13.49` (default) | nonnegative scalar value

`BatchSize` — Mini-batch size
positive integer

`LearnRate` — Learning rate
positive scalar

`OptimizeLearnRate` — Flag to decrease learning rate
`true` (default) | `false`

`TruncationPeriod` — Number of mini-batches between lasso truncation runs
`10` (default) | positive integer

`BatchLimit` — Maximal number of batches
positive integer

`BetaTolerance` — Relative tolerance on linear coefficients and bias term
`1e-4` (default) | nonnegative scalar

`NumCheckConvergence` — Number of batches to process before next convergence check
positive integer

`PassLimit` — Maximal number of passes
`1` (default) | positive integer

`BetaTolerance` — Relative tolerance on linear coefficients and bias term
`1e-4` (default) | nonnegative scalar

`DeltaGradientTolerance` — Gradient-difference tolerance
`1` (default) | nonnegative scalar

`NumCheckConvergence` — Number of passes through entire data set to process before next convergence check
`5` (default) | positive integer

`PassLimit` — Maximal number of passes
`10` (default) | positive integer

`BetaTolerance` — Relative tolerance on linear coefficients and bias term
`1e-4` (default) | nonnegative scalar

`GradientTolerance` — Absolute gradient tolerance
`1e-6` (default) | nonnegative scalar

`HessianHistorySize` — Size of history buffer for Hessian approximation
`15` (default) | positive integer

`IterationLimit` — Maximal number of optimization iterations
`1000` (default) | positive integer

`t` — Linear learner template
template object

Tall Arrays
Calculate with arrays that have more rows than fit in memory.