This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

Deep Learning Using Bayesian Optimization

This example shows how to apply Bayesian optimization to deep learning and find optimal network parameters and training options for convolutional neural networks.

To train a deep neural network, you must specify the neural network architecture, as well as options of the training algorithm. Selecting and tuning these parameters can be difficult and take time. Bayesian optimization is an algorithm well suited to optimizing internal parameters of classification and regression models. You can use Bayesian optimization to optimize functions that are nondifferentiable, discontinuous, and time-consuming to evaluate. The algorithm internally maintains a Gaussian process model of the objective function, and uses objective function evaluations to train this model.

This example shows how to:

  • Download and prepare the CIFAR-10 data set for network training. This data set is one of the most widely used data sets for testing image classification models.

  • Specify variables to optimize using Bayesian optimization. These variables are options of the training algorithm, as well as parameters of the network architecture itself.

  • Define the objective function, which takes the values of the optimization variables as inputs, specifies the network architecture and training options, trains and validates the network, and saves the trained network to disk. The objective function is defined at the end of this script.

  • Perform Bayesian optimization by minimizing the classification error on the validation set.

  • Load the best network from disk and evaluate it on the test set.

Prepare Data

Download the CIFAR-10 data set [1]. This data set contains 60,000 images, and each image has the size 32-by-32 and three color channels (RGB). The size of the whole data set is 175 MB. Depending on your internet connection, the download process can take some time.

cifar10DataDir = pwd;
url = 'https://www.cs.toronto.edu/~kriz/cifar-10-matlab.tar.gz';
helperCIFAR10Data.download(url,cifar10DataDir);
Downloading CIFAR-10 dataset (175 MB). This can take a while...done.

Load the CIFAR-10 images as training images and labels, and test images and labels. To enable network validation, use 5000 of the training images for validation.

[XTrain,YTrain,XTest,YTest] = helperCIFAR10Data.load(cifar10DataDir);

idx = randperm(numel(YTrain),5000);
XValidation = XTrain(:,:,:,idx);
XTrain(:,:,:,idx) = [];
YValidation = YTrain(idx);
YTrain(idx) = [];

Display a sample of the training images.

figure;
idx = randperm(numel(YTrain),20);
for i = 1:numel(idx)
    subplot(4,5,i);
    imshow(XTrain(:,:,:,idx(i)));
end

Choose Variables to Optimize

Choose which variables to optimize using Bayesian optimization, and specify the ranges to search in. Also, specify whether the variables are integers and whether to search the interval in logarithmic space. Optimize the following variables:

  • Network depth. This parameter controls the depth of the network. The network has three sections, each with NetworkDepth identical convolutional layers. So the total number of convolutional layers is 3*NetworkDepth. The objective function later in the script takes the number of convolutional filters in each layer proportional to 1/sqrt(NetworkDepth). As a result, the number of parameters and the required amount of computation for each iteration are roughly the same for different network depths.

  • Initial learning rate. The best learning rate can depend on your data as well as the network you are training.

  • Stochastic gradient descent momentum. Momentum adds inertia to the parameter updates by having the current update contain a contribution proportional to the update in the previous iteration. This results in more smooth parameter updates and a reduction of the noise inherent to stochastic gradient descent.

  • L2 regularization strength. Use regularization to prevent overfitting. Search the space of regularization strength to find a good value. Data augmentation and batch normalization also help regularize the network.

optimVars = [
    optimizableVariable('NetworkDepth',[1 3],'Type','integer')
    optimizableVariable('InitialLearnRate',[1e-3 1e-1],'Transform','log')
    optimizableVariable('Momentum',[0.8 0.95])
    optimizableVariable('L2Regularization',[1e-10 1e-2],'Transform','log')];

Perform Bayesian Optimization

Create the objective function for the Bayesian optimizer, using the training and validation data as inputs. The objective function trains a convolutional neural network and returns the classification error on the validation set. This function is defined at the end of this script. Because bayesopt uses the error rate on the validation set to chose the best model, it is possible that the final network overfits on the validation set. The final chosen model is then tested on the independent test set to estimate the generalization error.

ObjFcn = makeObjFcn(XTrain,YTrain,XValidation,YValidation);

Perform Bayesian optimization by minimizing the classification error on the validation set. Specify the maximum number of objective function evaluations, and set the maximum total optimization time to eight hours. To better utilize the power of Bayesian optimization, you should perform at least 30 objective function evaluations. To train networks in parallel on multiple GPUs, set the 'UseParallel' value to true. If you have a single GPU and set the 'UseParallel' value to true, then all workers share that GPU, and you obtain no training speed-up and increase the chances of the GPU running out of memory.

After each network finishes training, bayesopt prints the results to the command window. The bayesopt function then returns the file names in BayesObject.UserDataTrace. The objective function saves the trained networks to disk and returns the file names to bayesopt.

BayesObject = bayesopt(ObjFcn,optimVars,...
    'MaxObj',30,...
    'MaxTime',8*60*60,...
    'IsObjectiveDeterministic',false,...
    'UseParallel',false);
|===================================================================================================================================|
| Iter | Eval   | Objective   | Objective   | BestSoFar   | BestSoFar   | NetworkDepth | InitialLearn-|     Momentum | L2Regulariza-|
|      | result |             | runtime     | (observed)  | (estim.)    |              | Rate         |              | tion         |
|===================================================================================================================================|
|    1 | Best   |      0.2238 |      1219.5 |      0.2238 |      0.2238 |            3 |    0.0011905 |      0.84405 |   1.7848e-09 |
|    2 | Best   |      0.2168 |      854.05 |      0.2168 |     0.21732 |            2 |    0.0032103 |      0.93633 |   2.8194e-07 |
|    3 | Accept |      0.2598 |      592.47 |      0.2168 |      0.2168 |            1 |      0.06711 |      0.85864 |   0.00023755 |
|    4 | Best   |      0.2126 |      1014.2 |      0.2126 |     0.21435 |            2 |     0.011333 |      0.83823 |     0.002273 |
|    5 | Accept |      0.2854 |      1082.4 |      0.2126 |     0.21725 |            2 |     0.093251 |      0.80842 |    0.0078033 |
|    6 | Accept |      0.3036 |       749.5 |      0.2126 |     0.21261 |            1 |      0.00634 |      0.86205 |   4.9925e-10 |
|    7 | Best   |      0.2004 |      1471.8 |      0.2004 |     0.20041 |            3 |    0.0089328 |      0.84358 |   0.00060863 |
|    8 | Accept |       0.212 |      1468.3 |      0.2004 |     0.20041 |            3 |    0.0031215 |      0.88325 |    0.0001681 |
|    9 | Accept |      0.3102 |      742.12 |      0.2004 |     0.20041 |            1 |    0.0075539 |      0.80846 |   3.0432e-05 |
|   10 | Best   |      0.1948 |      1471.4 |      0.1948 |     0.19481 |            3 |     0.032527 |      0.85274 |   1.0716e-10 |
|   11 | Accept |      0.2026 |      1483.4 |      0.1948 |      0.1948 |            3 |     0.096873 |      0.80011 |    0.0007821 |
|   12 | Best   |      0.1856 |      1470.3 |      0.1856 |     0.18562 |            3 |     0.024357 |       0.9491 |   1.0253e-10 |
|   13 | Accept |      0.2004 |      1478.2 |      0.1856 |     0.19038 |            3 |     0.063631 |      0.94986 |   0.00046993 |
|   14 | Accept |      0.1906 |      1471.7 |      0.1856 |     0.19027 |            3 |      0.02099 |      0.92156 |   1.0042e-06 |
|   15 | Accept |      0.1964 |        1475 |      0.1856 |     0.19196 |            3 |     0.021181 |      0.88296 |   6.2176e-10 |
|   16 | Accept |      0.3366 |      735.77 |      0.1856 |     0.19226 |            1 |    0.0010001 |      0.94413 |   1.3067e-06 |
|   17 | Accept |      0.2574 |      1073.8 |      0.1856 |     0.19212 |            2 |    0.0010001 |      0.84967 |   2.2004e-05 |
|   18 | Accept |       0.244 |      747.75 |      0.1856 |     0.19189 |            1 |      0.09998 |      0.90032 |    1.241e-09 |
|   19 | Accept |      0.2454 |      1468.9 |      0.1856 |     0.19233 |            3 |     0.024573 |      0.92696 |    0.0073605 |
|   20 | Accept |      0.1958 |      1471.2 |      0.1856 |     0.19275 |            3 |     0.095612 |      0.92948 |   1.0602e-10 |
|===================================================================================================================================|
| Iter | Eval   | Objective   | Objective   | BestSoFar   | BestSoFar   | NetworkDepth | InitialLearn-|     Momentum | L2Regulariza-|
|      | result |             | runtime     | (observed)  | (estim.)    |              | Rate         |              | tion         |
|===================================================================================================================================|
|   21 | Accept |      0.2014 |      1467.3 |      0.1856 |     0.19415 |            3 |    0.0093495 |      0.87271 |   1.0269e-10 |
|   22 | Accept |      0.1886 |      1468.5 |      0.1856 |      0.1928 |            3 |     0.099819 |      0.86268 |   1.1226e-10 |
|   23 | Accept |      0.1906 |      1477.5 |      0.1856 |     0.19127 |            3 |     0.064742 |       0.8007 |   1.0202e-10 |
|   24 | Accept |      0.2016 |      1088.6 |      0.1856 |     0.19092 |            2 |     0.022023 |      0.94483 |   1.0303e-10 |

__________________________________________________________
Optimization completed.
MaxTime of 28800 seconds reached.
Total function evaluations: 24
Total elapsed time: 29061.1178 seconds.
Total objective function evaluation time: 29043.7828

Best observed feasible point:
    NetworkDepth    InitialLearnRate    Momentum    L2Regularization
    ____________    ________________    ________    ________________

         3              0.024357         0.9491        1.0253e-10   

Observed objective function value = 0.1856
Estimated objective function value = 0.19092
Function evaluation time = 1470.3366

Best estimated feasible point (according to models):
    NetworkDepth    InitialLearnRate    Momentum    L2Regularization
    ____________    ________________    ________    ________________

         3              0.064742         0.8007        1.0202e-10   

Estimated objective function value = 0.19092
Estimated function evaluation time = 1478.0471

Evaluate Final Network

Load the best network found in the optimization and its validation accuracy.

bestIdx = BayesObject.IndexOfMinimumTrace(end);
fileName = BayesObject.UserDataTrace{bestIdx};
savedStruct = load(fileName);
valError = savedStruct.valError
valError = 0.1856

Predict the labels of the test set and calculate the test error. Treat the classification of each image in the test set as independent events with a certain probability of success, which means that the number of incorrectly classified images follows a binomial distribution. Use this to calculate the standard error (testErrorSE) and an approximate 95% confidence interval (testError95CI) of the generalization error rate. This method is often called the Wald method. bayesopt determines the best network using the validation set without exposing the network to the test set. It is then possible that the test error is higher than the validation error.

[YPredicted,probs] = classify(savedStruct.trainedNet,XTest);
testError = 1 - mean(YPredicted == YTest)
testError = 0.1938
NTest = numel(YTest);
testErrorSE = sqrt(testError*(1-testError)/NTest);
testError95CI = [testError - 1.96*testErrorSE, testError + 1.96*testErrorSE]
testError95CI = 1×2

    0.1861    0.2015

Calculate the confusion matrix for the test data and display it as a heatmap. The highest confusion is between cats and dogs.

figure
[cmat,classNames] = confusionmat(YTest,YPredicted);
h = heatmap(classNames,classNames,cmat);
xlabel('Predicted Class');
ylabel('True Class');
title('Confusion Matrix');

Display some test images together with their predicted classes and the probabilities of those classes.

figure
idx = randperm(numel(YTest),9);
for i = 1:numel(idx)
    subplot(3,3,i)
    imshow(XTest(:,:,:,idx(i)));
    prob = num2str(100*max(probs(idx(i),:)),3);
    predClass = char(YPredicted(idx(i)));
    label = [predClass,', ',prob,'%'];
    title(label)
end

Objective Function for Optimization

Define the objective function for optimization. This function performs the following steps:

  1. Takes the values of the optimization variables as inputs. bayesopt calls the objective function with the current values of the optimization variables in a table with each column name equal to the variable name. For example, the current value of the network depth is optVars.NetworkDepth.

  2. Defines the network architecture and training options.

  3. Trains and validates the network.

  4. Saves the trained network, the validation error, and the training options to disk.

  5. Returns the validation error and the file name of the saved network.

function ObjFcn = makeObjFcn(XTrain,YTrain,XValidation,YValidation)
ObjFcn = @valErrorFun;
    function [valError,cons,fileName] = valErrorFun(optVars)

Define the convolutional neural network architecture.

  • Add padding to the convolutional layers so that the spatial output size is always the same as the input size.

  • Each time you down-sample the spatial dimensions by a factor of two using max pooling layers, increase the number of filters by a factor of two. Doing so ensures that the amount of computation required in each convolutional layer is roughly the same.

  • Choose the number of filters proportional to 1/sqrt(NetworkDepth), so that networks of different depths have roughly the same number of parameters and require about the same amount of computation per iteration. To increase the number of network parameters and the overall network flexibility, increase initialNumFilters. To train even deeper networks, change the range of the NetworkDepth variable.

  • Use convBlock(filterSize,numFilters,numConvLayers) to create a block of numConvLayers convolutional layers, each with a specified filterSize and numFilters filters, and each followed by a batch normalization layer and a ReLU layer. The convBlock function is defined at the end of this example.

        imageSize = [32 32 3];
        numClasses = numel(unique(YTrain));
        initialNumFilters = round(16/sqrt(optVars.NetworkDepth));
        layers = [
            imageInputLayer(imageSize)
            
            % The spatial input and output sizes of these convolutional
            % layers are 32-by-32, and the following max pooling layer
            % reduces this to 16-by-16.
            convBlock(3,initialNumFilters,optVars.NetworkDepth)
            maxPooling2dLayer(2,'Stride',2)
            
            % The spatial input and output sizes of these convolutional
            % layers are 16-by-16, and the following max pooling layer
            % reduces this to 8-by-8.
            convBlock(3,2*initialNumFilters,optVars.NetworkDepth)
            maxPooling2dLayer(2,'Stride',2)
            
            % The spatial input and output sizes of these convolutional
            % layers are 8-by-8. The global average pooling layer averages
            % over the 8-by-8 inputs, giving an output of size
            % 1-by-1-by-4*initialNumFilters. With a global average
            % pooling layer, the final classification output is only
            % sensitive to the total amount of each feature present in the
            % input image, but insensitive to the spatial positions of the
            % features.
            convBlock(3,4*initialNumFilters,optVars.NetworkDepth)
            averagePooling2dLayer(8)
            
            % Add the fully connected layer and the final softmax and
            % classification layers.
            fullyConnectedLayer(numClasses)
            softmaxLayer
            classificationLayer];

Specify options for network training. Optimize the initial learning rate, SGD momentum, and L2 regularization strength.

Specify validation data and choose the 'ValidationFrequency' value such that trainNetwork validates the network once per epoch. Train for a fixed number of epochs and lower the learning rate by a factor of 10 during the last epochs. This reduces the noise of the parameter updates and lets the network parameters settle down closer to a minimum of the loss function.

        miniBatchSize = 128;
        validationFrequency = floor(numel(YTrain)/miniBatchSize);
        options = trainingOptions('sgdm',...
            'InitialLearnRate',optVars.InitialLearnRate,...
            'Momentum',optVars.Momentum,...
            'MaxEpochs',40, ...
            'LearnRateSchedule','piecewise',...
            'LearnRateDropPeriod',35,...
            'LearnRateDropFactor',0.1,...
            'MiniBatchSize',miniBatchSize,...
            'L2Regularization',optVars.L2Regularization,...
            'Shuffle','every-epoch',...
            'Verbose',false,...
            'Plots','training-progress',...
            'ValidationData',{XValidation,YValidation},...
            'ValidationPatience',Inf,...
            'ValidationFrequency',validationFrequency);

Use data augmentation to randomly flip the training images along the vertical axis, and randomly translate them up to four pixels horizontally and vertically. Data augmentation helps prevent the network from overfitting and memorizing the exact details of the training images.

        pixelRange = [-4 4];
        imageAugmenter = imageDataAugmenter(...
            'RandXReflection',true,...
            'RandXTranslation',pixelRange,...
            'RandYTranslation',pixelRange);
        datasource = augmentedImageDatastore(imageSize,XTrain,YTrain,...
            'DataAugmentation',imageAugmenter,...
            'OutputSizeMode','randcrop');

Train the network and plot the training progress during training. Close all training plots after training finishes.

        trainedNet = trainNetwork(datasource,layers,options);
        close(findall(groot,'Tag','NNET_CNN_TRAININGPLOT_FIGURE'))

Evaluate the trained network on the validation set, calculate the predicted image labels, and calculate the error rate on the validation data.

        YPredicted = classify(trainedNet,XValidation);
        valError = 1 - mean(YPredicted == YValidation);

Create a file name containing the validation error, and save the network, validation error, and training options to disk. The objective function returns fileName as an output argument, and bayesopt returns all the file names in BayesObject.UserDataTrace. The additional required output argument cons specifies constraints among the variables. There are no variable constraints.

        fileName = num2str(valError) + ".mat";
        save(fileName,'trainedNet','valError','options')
        cons = [];
        
    end
end

The convBlock function creates a block of numConvLayers convolutional layers, each with a specified filterSize and numFilters filters, and each followed by a batch normalization layer and a ReLU layer.

function layers = convBlock(filterSize,numFilters,numConvLayers)
layers = [
    convolution2dLayer(filterSize,numFilters,'Padding','same')
    batchNormalizationLayer
    reluLayer];
layers = repmat(layers,numConvLayers,1);
end

References

[1] Krizhevsky, Alex. "Learning multiple layers of features from tiny images." (2009). https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf

See Also

| |

Related Topics