Main Content

Tune Experiment Hyperparameters by Using Bayesian Optimization

This example shows how to use Bayesian optimization in Experiment Manager to find optimal network hyperparameters and training options for convolutional neural networks. Bayesian optimization provides an alternative strategy to sweeping hyperparameters in an experiment. You specify a range of values for each hyperparameter and select a metric to optimize, and Experiment Manager searches for a combination of hyperparameters that optimizes your selected metric. Bayesian optimization requires Statistics and Machine Learning Toolbox™.

In this example, you train a network to classify images from the CIFAR-10 data set. The experiment uses Bayesian optimization to find the combination of hyperparameters that minimizes a custom metric function. The hyperparameters include options of the training algorithm, as well as parameters of the network architecture itself. The custom metric function determines the classification error on a randomly chosen test set. For more information on defining custom metrics in Experiment Manager, see Evaluate Deep Learning Experiments by Using Metric Functions.

Alternatively, you can find optimal hyperparameter values programmatically by calling the bayesopt function. For more information, see Deep Learning Using Bayesian Optimization.

Open Experiment

First, open the example. Experiment Manager loads a project with a preconfigured experiment that you can inspect and run. To open the experiment, in the Experiment Browser pane, double-click the name of the experiment (BayesOptExperiment).

An experiment definition consists of a description, a table of hyperparameters, a setup function, and a collection of metric functions to evaluate the results of the experiment. Experiments that use Bayesian optimization include additional options to limit the duration of the experiment. For more information, see Configure Deep Learning Experiment.

The Description box contains a textual description of the experiment. For this example, the description is:

Find optimal hyperparameters and training options for convolutional neural network.
Hyperparamters determine the network section depth, initial learning rate,
stochastic gradient descent momentum, and L2 regularization strength.

The Hyperparameters section specifies the strategy (Bayesian Optimization) and hyperparameter options to use for the experiment. For each hyperparameter, specify these options:

  • Range — Enter a two-element vector that gives the lower bound and upper bound of a real- or integer-valued hyperparameter, or a string array or cell array that lists the possible values of a categorical hyperparameter.

  • Type — Select real (real-valued hyperparameter), integer (integer-valued hyperparameter), or categorical (categorical hyperparameter).

  • Transform — Select none (no transform) or log (logarithmic transform). For log, the hyperparameter must be real or integer and positive. The hyperparameter is searched and modeled on a logarithmic scale.

When you run the experiment, Experiment Manager searches for the best combination of hyperparameters. Each trial in the experiment uses a new combination of hyperparameter values based on the results of the previous trials. This example uses these hyperparameters:

  • SectionDepth — This parameter controls the depth of the network. The total number of layers in the network is 9*SectionDepth+7. In the experiment setup function, the number of convolutional filters in each layer is proportional to 1/sqrt(SectionDepth), so the number of parameters and the required amount of computation for each iteration are roughly the same for different section depths.

  • InitialLearnRate — The best learning rate can depend on your data as well as the network you are training.

  • Momentum — Stochastic gradient descent momentum adds inertia to the parameter updates by having the current update contain a contribution proportional to the update in the previous iteration. The inertial effect results in smoother parameter updates and a reduction of the noise inherent to stochastic gradient descent.

  • L2Regularization — Use L2 regularization to prevent overfitting. Search the space of regularization strength to find a good value. Data augmentation and batch normalization also help regularize the network.

Under Bayesian Optimization Options, you can specify the duration of the experiment by entering the maximum time (in seconds) and the maximum number of trials to run. To best utilize the power of Bayesian optimization, you should perform at least 30 objective function evaluations.

The Setup Function configures the training data, network architecture, and training options for the experiment. To inspect the setup function, under Setup Function, click Edit. The setup function opens in MATLAB® Editor.

In this example, the setup function has three sections.

  • Load Image Data downloads and extracts images and labels from the CIFAR-10 data set. The data set is about 175 MB. Depending on your internet connection, the download process can take some time. For the training data, this example creates an augmentedImageDatastore by applying random translations and horizontal reflections. Data augmentation helps prevent the network from overfitting and memorizing the exact details of the training images. To enable network validation, the example uses 5000 images with no augmentation. For more information on this data set, see Image Data Sets.

  • Define Network Architecture defines the architecture for a convolutional neural network for deep learning classification. In this example, the network to train has three sections, each with SectionDepth identical convolutional layers. Each convolutional layer is followed by a batch normalization layer and a ReLU layer. The convolutional layers have added padding so that their spatial output size is always the same as the input size. Between sections, max pooling layers downsample the spatial dimensions by a factor of two. To ensure that the amount of computation required in each convolutional layer is roughly the same, the number of filters increases by a factor of two from one section to the next. The number of filters in each convolutional layer is proportional to 1/sqrt(SectionDepth), so that networks of different depths have roughly the same number of parameters and require about the same amount of computation per iteration.

  • Specify Training Options defines a trainingOptions object for the experiment using the values for the training options 'InitialLearnRate', 'Momentum', and 'L2Regularization' generated by the Bayesian optimization algorithm. The example trains the network for a fixed number of epochs, validating once per epoch and lowering the learning rate by a factor of 10 during the last epochs to reduce the noise of the parameter updates and allow the network parameters to settle down closer to a minimum of the loss function.

The Metrics section specifies optional functions that evaluate the results of the experiment. Experiment Manager evaluates these functions each time it finishes training the network. To inspect a metric function, select the name of the metric function and click Edit. The metric function opens in MATLAB Editor.

This example includes the custom metric function ErrorRate. This function selects 5000 test images and labels at random, evaluates the trained network on these images, and calculates the proportion of images that the network misclassifies.

function metricOutput = ErrorRate(trialInfo)
datadir = tempdir;
[~,~,XTest,YTest] = loadCIFARData(datadir);
idx = randperm(numel(YTest),5000);
XTest = XTest(:,:,:,idx);
YTest = YTest(idx);
YPredicted = classify(trialInfo.trainedNetwork,XTest);
metricOutput = 1 - mean(YPredicted == YTest);
end

The Optimize and Direction fields indicate the metric that the Bayesian optimization algorithm uses as an objective function. For this experiment, Experiment Manager seeks to minimize the value of the ErrorRate metric.

Run Experiment

When you run the experiment, Experiment Manager searches for the best combination of hyperparameters with respect to the chosen metric. Each trial in the experiment uses a new combination of hyperparameter values based on the results of the previous trials. By default, Experiment Manager runs one trial at a time. If you have Parallel Computing Toolbox™, you can run multiple trials at the same time. For best results, before you run your experiment, start a parallel pool with as many workers as GPUs. For more information, see Use Experiment Manager to Train Networks in Parallel.

  • To run one trial of the experiment at a time, in the Experiment Manager toolstrip, click Run.

  • To run multiple trials at the same time, click Use Parallel and then Run. If there is no current parallel pool, Experiment Manager starts one using the default cluster profile. Experiment Manager then executes multiple simultaneous trials, depending on the number of parallel workers available.

A table of results displays the metric function values for each trial. Experiment Manager indicates the trial with the optimal value for the selected metric. For example, in this experiment, the third trial produces the smallest error rate.

Evaluate Results

To test the best trial in your experiment, first select the row in the results table with the lowest error rate.

To display the confusion matrix for the selected trial, click Confusion Matrix.

To perform additional computations, export the trained network to the workspace. On the Experiment Manager toolstrip, click Export. In the dialog window, enter the name of a workspace variable for the exported network. The default name is trainedNetwork.

Use the exported network as the input to the helper function testSummary. This function evaluates the network in several ways:

  • It predicts the labels of the entire test set and calculates the test error. Because Experiment Manager determines the best network without exposing the network to the entire test set, the test error can be higher than the value of the custom metric ErrorRate.

  • It calculates the standard error (testErrorSE) and an approximate 95% confidence interval (testError95CI) of the generalization error rate by treating the classification of each image in the test set as an independent event with a certain probability of success. Using this assumption, the number of incorrectly classified images follows a binomial distribution. This method is often called the Wald method.

  • It displays some test images together with their predicted classes and the probabilities of those classes.

The function displays a summary of these statistics in the MATLAB Command Window.

******************************************
Test error rate: 0.1801
Standard error: 0.0038
95% confidence interval: [0.1726, 0.1876]
******************************************

Close Experiment

In the Experiment Browser pane, right-click the name of the project and select Close Project. Experiment Manager closes all of the experiments and results contained in the project.

See Also

| (Statistics and Machine Learning Toolbox) | (Statistics and Machine Learning Toolbox)

Related Topics