# Documentation

### This is machine translation

Translated by
Mouseover text to see original. Click the button below to return to the English verison of the page.

# trainingOptions

Options for training neural network

## Syntax

``options = trainingOptions(solverName)``
``options = trainingOptions(solverName,Name,Value)``

## Description

````options = trainingOptions(solverName)` returns a set of training options for the solver specified by `solverName`.```

example

````options = trainingOptions(solverName,Name,Value)` returns a set of training options with additional options specified by one or more `Name,Value` pair arguments.```

## Examples

collapse all

Create a set of options for training a network using stochastic gradient descent with momentum. Reduce the learning rate by a factor of 0.2 every 5 epochs. Set the maximum number of epochs for training at 20, and use a mini-batch with 300 observations at each iteration. Specify a path for saving checkpoint networks after every epoch.

```options = trainingOptions('sgdm',... 'LearnRateSchedule','piecewise',... 'LearnRateDropFactor',0.2,... 'LearnRateDropPeriod',5,... 'MaxEpochs',20,... 'MiniBatchSize',300,... 'CheckpointPath','C:\TEMP\checkpoint');```

Plot the training accuracy at each iteration of the training process.

```[XTrain,YTrain] = digitTrain4DArrayData; ```

Construct a simple network to classify the digit image data.

```layers = [ ... imageInputLayer([28 28 1],'Normalization','none') convolution2dLayer(6,20) reluLayer maxPooling2dLayer(2,'Stride',2) fullyConnectedLayer(10) softmaxLayer classificationLayer]; ```

Save the function `plotTrainingAccuracy` on the MATLAB® path that plots training accuracy against the current iteration. `plotTrainingAccuracy` is defined at the end of this example.

Specify the training options. Set `'OutputFcn'` to be the `plotTrainingAccuracy` function. For quick training, set `'MaxEpochs'` to 5 and `'InitialLearnRate'` to 0.1. Train the network using `trainNetwork`.

```options = trainingOptions('sgdm','Verbose',false, ... 'MaxEpochs',5, ... 'InitialLearnRate',0.1, ... 'OutputFcn',@plotTrainingAccuracy); net = trainNetwork(XTrain,YTrain,layers,options); ```

Use the custom function `plotTrainingAccuracy` to plot `info.TrainingAccuracy` against `info.Iteration` at each function call.

```function plotTrainingAccuracy(info) persistent plotObj if info.State == "start" plotObj = animatedline; xlabel("Iteration") ylabel("Training Accuracy") elseif info.State == "iteration" addpoints(plotObj,info.Iteration,info.TrainingAccuracy) drawnow limitrate nocallbacks end end ```

Plot the training accuracy at each iteration, and if the mean accuracy of the previous 50 iterations reaches 95%, then stop training early.

```[XTrain,YTrain] = digitTrain4DArrayData; ```

Construct a simple network to classify the digit image data.

```layers = [ ... imageInputLayer([28 28 1],'Normalization','none') convolution2dLayer(6,20) reluLayer maxPooling2dLayer(2,'Stride',2) fullyConnectedLayer(10) softmaxLayer classificationLayer]; ```

Save the custom output functions `plotTrainingAccuracy` and `stopTrainingAtThreshold` on the MATLAB® path. `plotTrainingAccuracy` plots training progress, and if the mean accuracy of the previous 50 iterations reaches 95%, then `stopTrainingAtThreshold` stops training early. These functions are defined at the end of this example.

Specify custom output functions as a cell array of function handles. Set the output functions to be `plotTrainingAccuracy`, and `stopTrainingAtThreshold` with a 95% threshold.

```functions = { ... @plotTrainingAccuracy, ... @(info) stopTrainingAtThreshold(info,95)}; ```

Specify the training options. Set `'OutputFcn'` to be the cell array of function handles `functions`. Train the network using `trainNetwork`.

```options = trainingOptions('sgdm','Verbose',false, ... 'InitialLearnRate',0.1, ... 'OutputFcn',functions); net = trainNetwork(XTrain,YTrain,layers,options); ```

Update the plot at each iteration using `plotTrainingAccuracy` and `stopTrainingAtThreshold`. Use the custom function `plotTrainingAccuracy` to plot `info.TrainingAccuracy` against `info.Iteration`. Use `stopTrainingAtThreshold(info,thr)` to stop training if the mean accuracy of the previous 50 iterations is greater than `thr`.

```function plotTrainingAccuracy(info) persistent plotObj if info.State == "start" plotObj = animatedline; xlabel("Iteration") ylabel("Training Accuracy") elseif info.State == "iteration" addpoints(plotObj,info.Iteration,info.TrainingAccuracy) drawnow limitrate nocallbacks end end function stop = stopTrainingAtThreshold(info,thr) stop = false; if info.State ~= "iteration" return end persistent iterationAccuracy % Append accuracy for this iteration iterationAccuracy = [iterationAccuracy info.TrainingAccuracy]; % Evaluate mean of iteration accuracy and remove oldest entry if numel(iterationAccuracy) == 50 stop = mean(iterationAccuracy) > thr; iterationAccuracy(1) = []; end end ```

## Input Arguments

collapse all

Solver to use for training the network. You must specify `'sgdm'` (stochastic gradient descent with momentum).

### Name-Value Pair Arguments

Specify optional comma-separated pairs of `Name,Value` arguments. `Name` is the argument name and `Value` is the corresponding value. `Name` must appear inside single quotes (`' '`). You can specify several name and value pair arguments in any order as `Name1,Value1,...,NameN,ValueN`.

Example: `'InitialLearningRate',0.03,'L2Regularization',0.0005,'LearnRateSchedule','piecewise'` specifies the initial learning rate as 0.03, and the L2 regularization factor as 0.0005, and instructs the software to drop the learning rate every given number of epochs by multiplying with a set factor.

collapse all

Path for saving the checkpoint networks, specified as the comma-separated pair consisting of `'CheckpointPath'` and a character vector.

• If you do not specify a path (i.e., `''`), then the software does not save any checkpoint networks.

• If you specify a path, then `trainNetwork` saves checkpoint networks to this path after every epoch. It automatically and uniquely names each network. You can then load any of these networks and resume training from that network.

If the directory is not already created, you must first create it before specifying the path to save the checkpoint networks. If the path you specify is wrong, then `trainingOptions` returns an error.

Example: `'CheckpointPath','C:\Temp\checkpoint'`

Data Types: `char`

Hardware resource for `trainNetwork` to train the network, specified as the comma-separated pair consisting of `'ExecutionEnvironment'` and one of the following:

• `'auto'` — Use a GPU if it is available, otherwise uses the CPU.

• `'cpu'` — Use the CPU.

• `'gpu'` — Use the GPU.

• `'multi-gpu'` — Use multiple GPUs on one machine, using a local parallel pool. If no pool is already open, `trainNetwork` opens one with one worker per supported GPU device.

• `'parallel'` — Use a local parallel pool or compute cluster. If no pool is already open, `trainNetwork` opens one using the default cluster profile. If the pool has access to GPUs, then `trainNetwork` uses them and excess workers are left idle. If the pool does not have GPUs, then the training takes place on all cluster CPUs.

`'gpu'`, `'multi-gpu'`, and `'parallel'` options require Parallel Computing Toolbox™. Additionally, to use a GPU, you must have a CUDA®-enabled NVIDIA® GPU with compute capability 3.0 or higher. If one of these options are chosen and Parallel Computing Toolbox or a suitable GPU is not available, `trainNetwork` returns an error.

To see an improvement in performance when training in parallel, you might need to increase `MiniBatchSize` to offset the communication overhead.

Example: `'ExecutionEnvironment','cpu'`

Data Types: `char`

Initial learning rate used for training, specified as the comma-separated pair consisting of `'InitialLearnRate'` and a positive scalar value. If the learning rate is too low, the training takes a long time, but if it is too high the training might reach a suboptimal result.

Example: `'InitialLearnRate',0.03`

Data Types: `single` | `double`

Option for dropping the learning rate during training, specified as the comma-separated pair consisting of `'LearnRateSchedule'` and one of the following:

• `'none'` — The learning rate remains constant through training.

• `'piecewise'` — The software updates the learning rate every certain number of epochs by multiplying with a factor. Use the `LearnRateDropFactor` name-value pair argument to specify the value of this factor. Use the `LearnRateDropPeriod` name-value pair argument to specify the number of epochs between multiplications.

Example: `'LearnRateSchedule','piecewise'`

Factor for dropping the learning rate, specified as the comma-separated pair consisting of `'LearnRateDropFactor'` and a scalar value. This option is valid only when the value of `LearnRateSchedule` is `'piecewise'`.

`LearnRateDropFactor` is a multiplicative factor to apply to the learning rate every time a certain number of epochs has passed. You can specify the number of epochs using the `LearnRateDropPeriod` name-value pair argument.

Example: `'LearnRateDropFactor',0.02`

Data Types: `single` | `double`

Number of epochs for dropping the learning rate, specified as the comma-separated pair consisting of `'LearnRateDropPeriod'` and an integer value. This option is valid only when the value of `LearnRateSchedule` is `'piecewise'`.

The software multiplies the global learning rate with the drop factor every time this number of epochs passes. The drop factor is specified by the `LearnRateDropFactor` name-value pair argument.

Example: `'LearnRateDropPeriod',3`

Data Types: `single` | `double`

Factor for L2 regularizer (weight decay), specified as the comma-separated pair consisting of `'L2Regularization'` and a positive scalar value.

You can specify a multiplier for this L2 regularizer when creating the convolutional layer and fully connected layer.

Example: `'L2Regularization',0.0005`

Data Types: `single` | `double`

Maximum number of epochs to use for training, specified as the comma-separated pair consisting of `'MaxEpochs'` and an integer value.

An iteration is one step taken in the gradient descent algorithm towards minimizing the loss function using a mini batch. An epoch is the full pass of the training algorithm over the entire training set.

Example: `'MaxEpochs',20`

Data Types: `single` | `double`

Size of the mini-batch to use for each training iteration, specified as the comma-separated pair consisting of `'MiniBatchSize'` and an integer value. A mini-batch is a subset of the training set that is used to evaluate the gradient of the loss function and update the weights. See Stochastic Gradient Descent with Momentum.

Example: `'MiniBatchSize',256`

Data Types: `single` | `double`

Contribution of the gradient step from the previous iteration to the current iteration of the training, specified as the comma-separated pair consisting of `'Momentum'` and a scalar value from 0 to 1. A value of 0 means no contribution from the previous step, whereas a value of 1 means maximal contribution from the previous step.

Example: `'Momentum',0.8`

Data Types: `single` | `double`

Indicator for data shuffle, specified as the comma-separated pair consisting of `'Shuffle'` and one of the following:

• `'once'` — The software shuffles the data once before training

• `'never'` — The software does not shuffle the data

Example: `'Shuffle','never'`

Indicator to display the information about the training progress in the command window, specified as the comma-separated pair consisting of `'Verbose'` and either `1` (`true`) or `0` (`false`).

The displayed information includes the number of epochs, number of iterations, time elapsed, mini-batch accuracy, and base learning rate. When training a regression network, RMSE is shown instead of accuracy.

Example: `'Verbose',0`

Data Types: `logical`

Number of iterations between printing to the command window. Only has an effect if `'Verbose'` is set to `true`.

Data Types: `single` | `double`

Relative division of load between workers of GPUs or CPUs for the `'ExecutionEnvironment','multi-gpu'` or `'ExecutionEnvironment','parallel'` options, specified as a numeric vector. This vector must contain one value per worker in the parallel pool. For a vector $w$, each worker gets ${w}_{i}/\sum _{i}{w}_{i}$ of the work. Use this option to balance the workload between unevenly performing hardware.

Data Types: `double`

Custom output functions to call during training, specified as a function handle or cell array of function handles. After each iteration, `trainNetwork` calls the specified functions and passes a struct containing information from the current iteration via the following fields.

FieldDescription
`Epoch`Current epoch number
`Iteration`Current iteration number
`TimeSinceStart`Time in seconds since the start of training
`TrainingLoss`Current mini-batch loss
`BaseLearnRate`Current base learning rate
`TrainingAccuracy`Accuracy of current mini batch (for classification networks)
`TrainingRMSE` (Regression network)RMSE of the current mini-batch (for regression networks)
`State`Current training state. (Possible values are `"start"`, `"iteration"`, or `"done"`.)

You can use custom output functions to display or plot progress information, or to stop training early. For an example showing how to plot training accuracy during training, see Plot Training Accuracy During Network Training. To stop training early, the function must return `true`. For an example showing how to stop training early, see Plot Progress and Stop Training at Specified Accuracy.

Data Types: `function_handle` | `cell`

## Output Arguments

collapse all

Training options returned as an object.

For the `sgdm` training solver, `options` is a `TrainingOptionsSGDM` object.

## Algorithms

collapse all

### Initial Weights and Biases

The default for the initial weights is a Gaussian distribution with a mean of 0 and a standard deviation of 0.01. The default for the initial bias value is 0. You can manually change the initialization for the weights and biases. See Specify Initial Weight and Biases in Convolutional Layer and Specify Initial Weight and Biases in Fully Connected Layer.

### Stochastic Gradient Descent with Momentum

The gradient descent algorithm updates the parameters (weights and biases) so as to minimize the error function by taking small steps in the direction of the negative gradient of the loss function [1]:

`${\theta }_{\ell +1}={\theta }_{\ell }-\alpha \nabla E\left({\theta }_{\ell }\right),$`

where $\ell$ stands for the iteration number, $\alpha >0$ is the learning rate, $\theta$ is the parameter vector, and $E\left(\theta \right)$ is the loss function. The gradient of the loss function, $\nabla E\left(\theta \right)$, is evaluated using the entire training set, and the standard gradient descent algorithm uses the entire data set at once. The stochastic gradient descent algorithm evaluates the gradient, hence updates the parameters, using a subset of the training set. This subset is called a mini-batch.

Each evaluation of the gradient using the mini-batch is an iteration. At each iteration, the algorithm takes one step towards minimizing the loss function. The full pass of the training algorithm over the entire training set using mini-batches is an epoch. You can specify the mini-batch size and the maximum number of epochs using the `MiniBatchSize` and `MaxEpochs` name-value pair arguments, respectively.

The gradient descent algorithm might oscillate along the steepest descent path to the optimum. Adding a momentum term to the parameter update is one way to prevent this oscillation [2]. The SGD update with momentum is

`${\theta }_{\ell +1}={\theta }_{\ell }-\alpha \nabla E\left({\theta }_{\ell }\right)+\gamma \left({\theta }_{\ell }-{\theta }_{\ell -1}\right),$`

where $\gamma$ determines the contribution of the previous gradient step to the current iteration. You can specify this value using the `Momentum` name-value pair argument.

By default, the software shuffles the data once before training. You change this setting using the `Shuffle` name-value pair argument.

### L2 Regularization

Adding a regularization term for the weights to the loss function $E\left(\theta \right)$ is one way to reduce overfitting, hence the complexity of a neural network [1], [2]. The regularization term is also called weight decay. The loss function with the regularization term takes the form

`${E}_{R}\left(\theta \right)=E\left(\theta \right)+\lambda \Omega \left(w\right),$`

where $w$ is the weight vector, $\lambda$ is the regularization factor (coefficient), and the regularization function, $\Omega \left(w\right)$ is:

`$\Omega \left(w\right)=\frac{1}{2}{w}^{T}w.$`

Note that the biases are not regularized [2]. You can specify the regularization factor, $\lambda$, using the `L2Regularization` name-value pair argument.

### Save Checkpoint Networks and Resume Training

`trainNetwork` enables you to save checkpoint networks as .mat files during training. You can then resume training from any of these checkpoint networks. If you want `trainNetwork` to save checkpoint networks, then you must specify the name of the path using the `CheckpointPath` name-value pair argument in the call to `trainingOptions`. If the path you specify is wrong, then `trainingOptions` returns an error.

`trainNetwork` automatically assigns unique names to these checkpoint network files. For example, `convnet_checkpoint__351__2016_11_09__12_04_23.mat`, where 351 is the iteration number, 2016_11_09 is the date and 12_04_21 is the time `trainNetwork` saves the network. You can load any of these by double clicking on them or typing, for example,

`load convnet_checkpoint__351__2016_11_09__12_04_23.mat`
in the command line. You can then resume training by using the layers of this network in the call to `trainNetwork`, for example,

`trainNetwork(Xtrain,Ytrain,net.Layers,options)`
You must manually specify the training options and the input data as the checkpoint network does not contain this information.

## References

[1] Bishop, C. M. Pattern Recognition and Machine Learning. Springer, New York, NY, 2006.

[2] Murphy, K. P. Machine Learning: A Probabilistic Perspective. The MIT Press, Cambridge, Massachusetts, 2012.