The first step of creating and training a new convolutional neural network (ConvNet) is to define the network architecture. This topic explains the details of ConvNet layers, and the order they appear in a ConvNet. For a complete list of deep learning layers and how to create them, see List of Deep Learning Layers. To learn about LSTM networks for sequence classification and regression, see Long Short-Term Memory Networks. To learn how to create your own custom layers, see Define Custom Deep Learning Layers.
The network architecture can vary depending on the types and numbers of layers included. The types and number of layers included depends on the particular application or data. For example, classification networks typically have a softmax layer and a classification layer, whereas regression networks must have a regression layer at the end of the network. A smaller network with only one or two convolutional layers might be sufficient to learn on a small number of grayscale image data. On the other hand, for more complex data with millions of colored images, you might need a more complicated network with multiple convolutional and fully connected layers.
To specify the architecture of a deep network with all layers connected sequentially, create an array of layers directly. For example, to create a deep network which classifies 28-by-28 grayscale images into 10 classes, specify the layer array
layers = [ imageInputLayer([28 28 1]) convolution2dLayer(3,16,'Padding',1) batchNormalizationLayer reluLayer maxPooling2dLayer(2,'Stride',2) convolution2dLayer(3,32,'Padding',1) batchNormalizationLayer reluLayer fullyConnectedLayer(10) softmaxLayer classificationLayer];
layersis an array of
Layerobjects. You can then use
layersas an input to the training function
To specify the architecture of a neural network with all layers connected sequentially,
create an array of layers directly. To specify the architecture of a network where layers
can have multiple inputs or outputs, use a
Create an image input layer using
An image input layer inputs images to a network and applies data normalization.
Specify the image size using the
inputSize argument. The size of an
image corresponds to the height, width, and the number of color channels of that image.
For example, for a grayscale image, the number of channels is 1, and for a color image
it is 3.
A 2-D convolutional layer applies sliding convolutional filters
to 2-D input. Create a 2-D convolutional layer using
The convolutional layer consists of various components.
A convolutional layer consists of neurons that connect to subregions of the input images or
the outputs of the previous layer. The layer learns the features localized by these regions
while scanning through an image. When creating a layer using the
convolution2dLayer function, you can specify the size of these regions using
filterSize input argument.
For each region, the
trainNetwork function computes a dot product of the
weights and the input, and then adds a bias term. A set of weights that is applied to a
region in the image is called a filter. The filter moves along the
input image vertically and horizontally, repeating the same computation for each region. In
other words, the filter convolves the input.
This image shows a 3-by-3 filter scanning through the input. The lower map represents the input and the upper map represents the output.
The step size with which the filter moves is called a stride. You can
specify the step size with the
Stride name-value pair argument. The
local regions that the neurons connect to can overlap depending on the
This image shows a 3-by-3 filter scanning through the input with a stride of 2. The lower map represents the input and the upper map represents the output.
The number of weights in a filter is h * w *
c, where h is the height, and w
is the width of the filter, respectively, and c is the number of channels
in the input. For example, if the input is a color image, the number of color channels is 3.
The number of filters determines the number of channels in the output of a convolutional
layer. Specify the number of filters using the
numFilters argument with
A dilated convolution is a convolution in which the filters are expanded by spaces inserted
between the elements of the filter. Specify the dilation factor using the
Use dilated convolutions to increase the receptive field (the area of the input which the layer can see) of the layer without increasing the number of parameters or computation.
The layer expands the filters by inserting zeros between each filter element. The dilation
factor determines the step size for sampling the input or equivalently the upsampling factor
of the filter. It corresponds to an effective filter size of (Filter Size
– 1) .* Dilation Factor + 1. For example, a 3-by-3 filter with the
[2 2] is equivalent to a 5-by-5 filter with zeros between
This image shows a 3-by-3 filter dilated by a factor of two scanning through the input. The lower map represents the input and the upper map represents the output.
As a filter moves along the input, it uses the same set of weights and the same bias for the convolution, forming a feature map. Each feature map is the result of a convolution using a different set of weights and a different bias. Hence, the number of feature maps is equal to the number of filters. The total number of parameters in a convolutional layer is ((h*w*c + 1)*Number of Filters), where 1 is the bias.
You can also apply padding to input image borders vertically and horizontally
'Padding' name-value pair argument. Padding is values
appended to the borders of a the input to increase its size. By adjusting the padding, you
can control the output size of the layer.
This image shows a 3-by-3 filter scanning through the input with padding of size 1. The lower map represents the input and the upper map represents the output.
The output height and width of a convolutional layer is (Input Size – ((Filter Size – 1)*Dilation Factor + 1) + 2*Padding)/Stride + 1. This value must be an integer for the whole image to be fully covered. If the combination of these options does not lead the image to be fully covered, the software by default ignores the remaining part of the image along the right and bottom edges in the convolution.
The product of the output height and width gives the total number of neurons in a feature map, say Map Size. The total number of neurons (output size) in a convolutional layer is Map Size*Number of Filters.
For example, suppose that the input image is a 32-by-32-by-3 color image. For a convolutional layer with eight filters and a filter size of 5-by-5, the number of weights per filter is 5 * 5 * 3 = 75, and the total number of parameters in the layer is (75 + 1) * 8 = 608. If the stride is 2 in each direction and padding of size 2 is specified, then each feature map is 16-by-16. This is because (32 – 5 + 2 * 2)/2 + 1 = 16.5, and some of the outermost padding to the right and bottom of the image is discarded. Finally, the total number of neurons in the layer is 16 * 16 * 8 = 2048.
Usually, the results from these neurons pass through some form of nonlinearity, such as rectified linear units (ReLU).
You can adjust the learning rates and regularization options
for the layer using name-value pair arguments while defining the convolutional layer. If you
choose not to specify these options, then
trainNetwork uses the global
training options defined with the
trainingOptions function. For details on
global and layer training options, see Set Up Parameters and Train Convolutional Neural Network.
A convolutional neural network can consist of one or multiple convolutional layers. The number of convolutional layers depends on the amount and complexity of the data.
Create a batch normalization layer using
A batch normalization layer normalizes a mini-batch of data across all observations for each channel independently. To speed up training of the convolutional neural network and reduce the sensitivity to network initialization, use batch normalization layers between convolutional layers and nonlinearities, such as ReLU layers.
The layer first normalizes the activations of each channel by subtracting the mini-batch mean and dividing by the mini-batch standard deviation. Then, the layer shifts the input by a learnable offset β and scales it by a learnable scale factor γ. β and γ are themselves learnable parameters that are updated during network training.
Batch normalization layers normalize the activations and gradients propagating through a
neural network, making network training an easier optimization problem. To take full
advantage of this fact, you can try increasing the learning rate. Since the optimization
problem is easier, the parameter updates can be larger and the network can learn faster. You
can also try reducing the L2 and dropout regularization. With batch
normalization layers, the activations of a specific image during training depend on which
images happen to appear in the same mini-batch. To take full advantage of this regularizing
effect, try shuffling the training data before every training epoch. To specify how often to
shuffle the data during training, use the
'Shuffle' name-value pair
Create a ReLU layer using
A ReLU layer performs a threshold operation to each element of the input, where any value less than zero is set to zero.
Convolutional and batch normalization layers are usually followed by a nonlinear activation function such as a rectified linear unit (ReLU), specified by a ReLU layer. A ReLU layer performs a threshold operation to each element, where any input value less than zero is set to zero, that is,
The ReLU layer does not change the size of its input.
There are other nonlinear activation layers that perform different operations and can improve the network accuracy for some applications. For a list of activation layers, see Activation Layers.
Create a cross channel normalization layer using
A channel-wise local response (cross-channel) normalization layer carries out channel-wise normalization.
This layer performs a channel-wise local response normalization. It usually follows the ReLU activation layer. This layer replaces each element with a normalized value it obtains using the elements from a certain number of neighboring channels (elements in the normalization window). That is, for each element in the input,
trainNetwork computes a normalized value using
where K, α, and β are the hyperparameters in the normalization, and ss is the sum of squares of the elements in the normalization window . You must specify the size of the normalization window using the
windowChannelSize argument of the
crossChannelNormalizationLayer function. You can also specify the hyperparameters using the
K name-value pair arguments.
The previous normalization formula is slightly different than what is presented in . You can obtain the equivalent formula by multiplying the
alpha value by the
A 2-D max pooling layer performs downsampling by dividing the
input into rectangular pooling regions, then computing the maximum of each region. Create a max pooling layer using
A 2-D average pooling layer performs downsampling by dividing
the input into rectangular pooling regions, then computing the average values of each
region. Create an average pooling layer using
Pooling layers follow the convolutional layers for down-sampling, hence, reducing the number of connections to the following layers. They do not perform any learning themselves, but reduce the number of parameters to be learned in the following layers. They also help reduce overfitting.
A max pooling layer returns the maximum values of rectangular regions of its input. The size of the rectangular regions is determined by the
poolSize argument of
maxPoolingLayer. For example, if
[2,3], then the layer returns the maximum value in regions of height 2 and width 3.An average pooling layer outputs the average values of rectangular regions of its input. The size of the rectangular regions is determined by the
poolSize argument of
averagePoolingLayer. For example, if
poolSize is [2,3], then the layer returns the average value of regions of height 2 and width 3.
Pooling layers scan through the input horizontally and vertically in step sizes you can specify using the
'Stride' name-value pair argument. If the pool size is smaller than or equal to the stride, then the pooling regions do not overlap.
For nonoverlapping regions (Pool Size and Stride are equal), if the input to the pooling layer is n-by-n, and the pooling region size is h-by-h, then the pooling layer down-samples the regions by h . That is, the output of a max or average pooling layer for one channel of a convolutional layer is n/h-by-n/h. For overlapping regions, the output of a pooling layer is (Input Size – Pool Size + 2*Padding)/Stride + 1.
Create a dropout layer using
A dropout layer randomly sets input elements to zero with a given probability.
At training time, the layer randomly sets input elements to zero given by the dropout mask
X is the layer input and then scales the remaining elements by
1/(1-Probability). This operation effectively changes the underlying network architecture between iterations and helps prevent the network from overfitting , . A higher number results in more elements being dropped during training. At prediction time, the output of the layer is equal to its input.
Similar to max or average pooling layers, no learning takes place in this layer.
Create a fully connected layer using
A fully connected layer multiplies the input by a weight matrix and then adds a bias vector.
The convolutional (and down-sampling) layers are followed by one or more fully connected layers.
As the name suggests, all neurons in a fully connected layer connect to all the neurons in the previous layer. This layer combines all of the features (local information) learned by the previous layers across the image to identify the larger patterns. For classification problems, the last fully connected layer combines the features to classify the images. This is the reason that the
outputSize argument of the last fully connected layer of the network is equal to the number of classes of the data set. For regression problems, the output size must be equal to the number of response variables.
You can also adjust the learning rate and the regularization parameters for this layer using
the related name-value pair arguments when creating the fully connected layer. If you choose
not to adjust them, then
trainNetwork uses the global training
parameters defined by the
trainingOptions function. For details on
global and layer training options, see Set Up Parameters and Train Convolutional Neural Network.
A fully connected layer multiplies the input by a weight matrix W and then adds a bias vector b.
If the input to the layer is a sequence (for example, in an LSTM network), then the fully connected layer acts independently on each time step. For example, if the layer before the fully connected layer outputs an array X of size D-by-N-by-S, then the fully connected layer outputs an array Z of size
outputSize-by-N-by-S. At time step t, the corresponding entry of Z is , where denotes time step t of X.
A softmax layer applies a softmax function to the input. Create a softmax layer using
A classification layer computes the cross-entropy loss for
classification and weighted classification tasks with mutually exclusive classes. Create a classification layer using
For classification problems, a softmax layer and then a classification layer usually follow the final fully connected layer.
The output unit activation function is the softmax function:
where and .
The softmax function is the output unit activation function after the last fully connected layer for multi-class classification problems:
where and . Moreover, , is the conditional probability of the sample given class r, and is the class prior probability.
The softmax function is also known as the normalized exponential and can be considered the multi-class generalization of the logistic sigmoid function .
For typical classification networks, the classification layer usually
follows a softmax layer. In the classification layer,
takes the values from the softmax function and assigns each input to one of the
K mutually exclusive classes using the cross entropy function for a
1-of-K coding scheme :
where N is the number of samples, K is the number of classes, is the weight for class i, is the indicator that the nth sample belongs to the ith class, and is the output for sample n for class i, which in this case, is the value from the softmax function. In other words, is the probability that the network associates the nth input with class i.
Create a regression layer using
A regression layer computes the half-mean-squared-error loss for regression tasks. For typical regression problems, a regression layer must follow the final fully connected layer.
For a single observation, the mean-squared-error is given by:
where R is the number of responses, ti is the target output, and yi is the network’s prediction for response i.
For image and sequence-to-one regression networks, the loss function of the regression layer is the half-mean-squared-error of the predicted responses, not normalized by R:
For image-to-image regression networks, the loss function of the regression layer is the half-mean-squared-error of the predicted responses for each pixel, not normalized by R:
where H, W, and C denote the height, width, and number of channels of the output respectively, and p indexes into each element (pixel) of t and y linearly.
For sequence-to-sequence regression networks, the loss function of the regression layer is the half-mean-squared-error of the predicted responses for each time step, not normalized by R:
where S is the sequence length.
When training, the software calculates the mean loss over the observations in the mini-batch.
 Murphy, K. P. Machine Learning: A Probabilistic Perspective. Cambridge, Massachusetts: The MIT Press, 2012.
 Krizhevsky, A., I. Sutskever, and G. E. Hinton. "ImageNet Classification with Deep Convolutional Neural Networks." Advances in Neural Information Processing Systems. Vol 25, 2012.
 LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D., et al. ''Handwritten Digit Recognition with a Back-propagation Network.'' In Advances of Neural Information Processing Systems, 1990.
 LeCun, Y., L. Bottou, Y. Bengio, and P. Haffner. ''Gradient-based Learning Applied to Document Recognition.'' Proceedings of the IEEE. Vol 86, pp. 2278–2324, 1998.
 Nair, V. and G. E. Hinton. "Rectified linear units improve restricted boltzmann machines." In Proc. 27th International Conference on Machine Learning, 2010.
 Nagi, J., F. Ducatelle, G. A. Di Caro, D. Ciresan, U. Meier, A. Giusti, F. Nagi, J. Schmidhuber, L. M. Gambardella. ''Max-Pooling Convolutional Neural Networks for Vision-based Hand Gesture Recognition''. IEEE International Conference on Signal and Image Processing Applications (ICSIPA2011), 2011.
 Srivastava, N., G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." Journal of Machine Learning Research. Vol. 15, pp. 1929-1958, 2014.
 Bishop, C. M. Pattern Recognition and Machine Learning. Springer, New York, NY, 2006.
 Ioffe, Sergey, and Christian Szegedy. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” Preprint, submitted March 2, 2015. https://arxiv.org/abs/1502.03167.