Training options for stochastic gradient descent with momentum
Class that is comprising training options such as learning rate information, L2 regularization factor, and mini-batch size for stochastic gradient descent with momentum.
options = trainingOptions( returns
a set of training options for the solver specified by
options = trainingOptions( returns
a set of training options, with additional options specified by one
Name,Value pair arguments.
For more options on the name-value pair arguments, see
Momentum— Contribution of the previous gradient step
Contribution of the gradient step from the previous iteration to the current iteration of the training. A value of 0 means no contribution, 1 means maximal contribution.
InitialLearnRate— Initial learning rate
Initial learning rate used for training, stored as a scalar value. If the learning rate is too low, the training takes a long time, but if it is too high the training might reach a suboptimal result.
LearnRateScheduleSettings— Settings for learning rate schedule, specified by the user
Settings for learning rate schedule, specified by the user,
stored as a structure.
has the following field:
Method — Name of the method
for adjusting the learning rate. Possible names are:
'fixed' — the software does
not alter the learning rate during training.
'piecewise' — the learning
rate drops periodically during training.
LearnRateScheduleSettings contains two more
DropRateFactor — The multiplicative
factor by which to drop the learning rate during training.
DropPeriod — The number
of epochs that should pass between adjustments to the learning rate
L2Regularization— Factor for L2 regularizer
Factor for L2 regularizer, stored as a scalar value. Each set of parameters in a layer can specify a multiplier for the L2 regularizer.
MaxEpochs— Maximum number of epochs
Maximum number of epochs to use for training, stored as an integer value.
MiniBatchSize— Size of the mini-batch
Size of the mini-batch to use for each training iteration, stored as an integer value.
Verbose— Indicator to display the information on the training progress
Indicator to display the information on the training progress
on the command window, stored as either
The displayed information includes the number of epochs, number of iterations, time elapsed, mini batch accuracy, and base learning rate.
CheckpointPath— Path where checkpoint networks are saved
Path where checkpoint networks are saved, stored as a character vector.
ExecutionEnvironment— Hardware to use for training the network
Hardware to use for training the network, stored as a character vector.
WorkerLoad— Relative division of load between parallel workers on different hardware
Relative division of load between parallel workers on different hardware, stored as a numeric vector.
Value. To learn how value classes affect copy operations, see Copying Objects (MATLAB) in the MATLAB® documentation.
Create a set of options for training with stochastic gradient descent with momentum. The learning rate will be reduced by a factor of 0.2 every 5 epochs. The training will last for 20 epochs, and each iteration will use a mini-batch with 300 observations.
options = trainingOptions('sgdm',... 'LearnRateSchedule','piecewise',... 'LearnRateDropFactor',0.2,... 'LearnRateDropPeriod',5,... 'MaxEpochs',20,... 'MiniBatchSize',300);
The gradient descent algorithm updates the parameters (weights and biases) so as to minimize the error function by taking small steps in the direction of the negative gradient of the loss function, :
where stands for the iteration number, is the learning rate, is the parameter vector, and is the loss function. The gradient of the loss function, , is evaluated using the entire training set, and the standard gradient descent algorithm uses the entire data set at once. The stochastic gradient descent algorithm evaluates the gradient, hence updates the parameters, using a subset of the training set. This subset is called a mini batch.
Each evaluation of the gradient using the mini batch is an iteration.
At each iteration, the algorithm takes one step towards minimizing
the loss function. The full pass of the training algorithm over the
entire training set using mini batches is an epoch. You can specify
the mini batch size and the maximum number of epochs using the
pair arguments, respectively.
The gradient descent algorithm might oscillate along the steepest descent path to the optimum. Adding a momentum term to the parameter update is one way to prevent this oscillation. The SGD update with momentum is
the contribution of the previous gradient step to the current iteration.
You can specify this value using the
 Bishop, C. M. Pattern Recognition and Machine Learning. Springer, New York, NY, 2006.
 Murphy, K. P. Machine Learning: A Probabilistic Perspective. The MIT Press, Cambridge, Massachusetts, 2012.