Note: This page has been translated by MathWorks. Please click here

To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

The first step of creating and training a new convolutional neural network (ConvNet) is to define the network architecture. This topic explains the details of ConvNet layers, and the order they appear in a ConvNet.

The architecture of a ConvNet can vary depending on the types and numbers of layers included. The types and number of layers included depends on the particular application or data. For example, if you have categorical responses, you must have a softmax layer and a classification layer, whereas if your response is continuous, you must have a regression layer at the end of the network. A smaller network with only one or two convolutional layers might be sufficient to learn on a small number of grayscale image data. On the other hand, for more complex data with millions of colored images, you might need a more complicated network with multiple convolutional and fully connected layers.

You can define the layers of a convolutional neural network
in MATLAB^{®} in an array format, for example,

layers = [ imageInputLayer([28 28 1]) convolution2dLayer(3,16,'Padding',1) batchNormalizationLayer reluLayer maxPooling2dLayer(2,'Stride',2) convolution2dLayer(3,32,'Padding',1) batchNormalizationLayer reluLayer fullyConnectedLayer(10) softmaxLayer classificationLayer];

`layers`

is an array of `Layer`

objects. `layers`

becomes
an input for the training function `trainNetwork`

.

The image input layer defines the size of the input images of a convolutional neural network
and contains the raw pixel values of the images. You can add an input
layer using the `imageInputLayer`

function. Specify the image size
using the `inputSize`

argument. The size of an image
corresponds to the height, weight, and the number of color channels of
that image. For example, for a grayscale image, the number of channels
is 1, and for a color image it is 3.

This layer can also perform data normalization by subtracting the mean image of the training set from every input image.

**Filters and Stride:** A convolutional layer consists of neurons
that connect to subregions of the input images or the outputs of the
layer before it. A convolutional layer learns the features localized
by these regions while scanning through an image. You can specify the
size of these regions using the `filterSize`

input
argument when you create the layer using the `convolution2dLayer`

function.

For each region, the `trainNetwork`

function computes a dot product of the
weights and the input, and then adds a bias term. A set of weights
that are applied to a region in the image is called a
*filter*. The filter moves along the
input image vertically and horizontally, repeating the same
computation for each region, that is, convolving the input. The step
size with which it moves is called a *stride*. You
can specify this step size with the `Stride`

name-value pair argument. These local regions that the neurons connect
to might overlap depending on the `filterSize`

and
`'Stride'`

values.

The number of weights used for a filter is
*h***w***c*,
where *h* is the height, and *w* is
the width of the filter size, and *c* is the number
of channels in the input (for example, if the input is a color image,
the number of color channels is 3). The number of filters determines
the number of channels in the output of a convolutional layer. Specify
the number of filters using the `numFilters`

argument of `convolution2dLayer`

.

**Feature Maps:** As a filter moves
along the input, it uses the same set of weights and bias for the
convolution, forming a *feature map*. Hence, the
number of feature maps a convolutional layer has is equal to the number
of filters (number of channels). Each feature map has a different
set of weights and a bias. So, the total number of parameters in a
convolutional layer is ((*h***w***c* +
1)**Number of Filters*), where 1 is for the bias.

**Zero Padding:** You can also apply zero padding to input image
borders vertically and horizontally using the
`'Padding'`

name-value pair argument.
Padding is basically adding rows or columns of zeros to the borders of
an image input. It helps you control the output size of the
layer.

**Output Size:** The output height
and width of a convolutional layer is (*Input Size* – *Filter
Size* + 2**Padding*)/*Stride* +
1. This value must be an integer for the whole image to be fully covered.
If the combination of these parameters does not lead the image to
be fully covered, the software by default ignores the remaining part
of the image along the right and bottom edge in the convolution.

**Number of Neurons:** The product
of the output height and width gives the total number of neurons in
a feature map, say *Map Size*. The total number of
neurons (output size) in a convolutional layer, then, is *Map
Size***Number of Filters*.

For example, suppose that the input image is a 28-by-28-by-3 color image. For a convolutional layer with 16 filters, and a filter size of 8-by-8, the number of weights per filter is 8*8*3 = 192, and the total number of parameters in the layer is (192+1) * 16 = 3088. Assuming stride is 4 in each direction and there is no zero padding, the total number of neurons in each feature map is 6-by-6 ((28 – 8+0)/4 + 1 = 6). Then, the total number of neurons in the layer is 6*6*16 = 256.

**Learning Parameters:** You can
also adjust the learning rates and regularization parameters for this
layer using the related name-value pair arguments while defining the
convolutional layer. If you choose not to adjust them, `trainNetwork`

uses
the global training parameters defined by `trainingOptions`

function.
For details on global and layer training options, see Set Up Parameters and Train Convolutional Neural Network.

A convolutional neural network can consist of one or multiple convolutional layers. The number of convolutional layers depends on the amount and complexity of the data.

Use batch normalization layers between convolutional layers and
nonlinearities such as ReLU layers to speed up network training and
reduce the sensitivity to network initialization. The layer first
normalizes the activations of each channel by subtracting the
mini-batch mean and dividing by the mini-batch standard deviation.
Then, the layer shifts the input by an offset *β* and
scales it by a scale factor *γ*. *β*
and *γ* are themselves learnable parameters that are
updated during network training. Create a batch normalization layer
using `batchNormalizationLayer`

.

Batch normalization layers normalize the activations and gradients
propagating through a neural network, making network training an
easier optimization problem. To take full advantage of this fact, you
can try increasing the learning rate. Since the optimization problem
is easier, the parameter updates can be larger and the network can
learn faster. You can also try reducing the L_{2}
and dropout regularization. With batch normalization layers, the
activations of a specific image are not deterministic, but instead
depend on which images happen to appear in the same mini-batch. To
take full advantage of this regularizing effect, try shuffling the
training data before every training epoch. To specify how often to
shuffle the data during training, use the
`'Shuffle'`

name-value pair argument of
`trainingOptions`

.

Convolutional and batch normalization layers are usually followed by a
nonlinear activation function such as a rectified linear unit (ReLU),
specified by a ReLU layer. Create a ReLU layer using the `reluLayer`

function. A ReLU layer performs a
threshold operation to each element, where any input value less than
zero is set to zero, that is,

$$f\left(x\right)=\{\begin{array}{cc}x,& x\ge 0\\ 0,& x<0\end{array}.$$

There are extensions of the standard ReLU layer that perform slightly
different operations and can improve performance for some
applications. A leaky ReLU layer multiplies input values less than
zero by a fixed scalar, allowing negative inputs to
“leak” into the output. Use the `leakyReluLayer`

function to create a leaky ReLU
layer. A clipped ReLU layer sets negative inputs to zero, but also
sets input values above a *clipping ceiling* equal
to that clipping ceiling. This clipping prevents the output from
becoming too large. Use the `clippedReluLayer`

function to create a clipped
ReLU layer.

This layer performs a channel-wise local response normalization. It usually follows the ReLU
activation layer. Create this layer using the `crossChannelNormalizationLayer`

function. This
layer replaces each element with a normalized value it obtains using
the elements from a certain number of neighboring channels (elements
in the normalization window). That is, for each element $$x$$ in the input, `trainNetwork`

computes a normalized value $${x}^{\text{'}}$$ using

$${x}^{\text{'}}=\frac{x}{{\left(K+\frac{\alpha *ss}{windowChannelSize}\right)}^{\beta}},$$

`windowChannelSize`

argument of the
`crossChannelNormalizationLayer`

function. You can also specify the hyperparameters using the
`Alpha`

, `Beta`

, and
`K`

name-value pair arguments.The previous normalization formula is slightly different than what is presented in [2]. You can obtain the equivalent formula by multiplying the
`alpha`

value by the
`windowChannelSize`

.

Max- and average-pooling layers follow the convolutional layers for
down-sampling, hence, reducing the number of connections to the
following layers (usually a fully connected layer). They do not
perform any learning themselves, but reduce the number of parameters
to be learned in the following layers. They also help reduce
overfitting. Create these layers using the `maxPooling2dLayer`

and `averagePooling2dLayer`

functions.

A max-pooling layer returns the maximum values of rectangular regions of its input. The size
of the rectangular regions is determined by the
`poolSize`

argument of
`maxPoolingLayer`

. For example, if
`poolSize`

equals
`[2,3]`

, then the layer returns the maximum
value in regions of height 2 and width 3.

Similarly, the average-pooling layer outputs the average values of rectangular regions of its
input. The size of the rectangular regions is determined by the
`poolSize`

argument of
`averagePoolingLayer`

. For example, if
`poolSize`

is [2,3], then the layer
returns the average value of regions of height 2 and width 3. The
`maxPoolingLayer`

and
`averagepoolingLayer`

functions scan
through the input horizontally and vertically in step sizes you can
specify using the `'Stride'`

name-value pair
argument of either function. If the `poolSize`

is
smaller than or equal to the `Stride`

, then the
pooling regions do not overlap.

For nonoverlapping regions (`poolSize`

and `Stride`

are
equal), if the input to the pooling layer is *n*-by-*n*,
and the pooling region size is *h*-by-*h*,
then the pooling layer down-samples the regions by *h* [6]. That is, the output of a max- or average-pooling
layer for one channel of a convolutional layer is *n*/*h*-by-*n*/*h*.
For overlapping regions, the output of a pooling layer is (*Input
Size* – *Pool Size* + 2**Padding*)/*Stride* +
1.

A dropout layer randomly sets the layer’s input elements to zero with a given
probability. Create a dropout layer using the `dropoutLayer`

function.

Although the output of a dropout layer is equal to its input,
this operation corresponds to temporarily dropping a randomly chosen
unit and all of its connections from the network during training.
So, for each new input element, `trainNetwork`

randomly
selects a subset of neurons, forming a different layer architecture.
These architectures use common weights, but because the learning does
not depend on specific neurons and connections, the dropout layer
might help prevent overfitting [7], [2].
Similar to max- or average-pooling layers, no learning takes place
in this layer.

The convolutional (and down-sampling) layers are followed by one or more
fully connected layers. Create a fully connected layer using the
`fullyConnectedLayer`

function.

As the name suggests, all neurons in a fully connected layer connect to all the neurons in the
previous layer. This layer combines all of the features (local
information) learned by the previous layers across the image to
identify the larger patterns. For classification problems, the last
fully connected layer combines the features to classify the images.
This is the reason that the `outputSize`

argument
of the last fully connected layer of the network is equal to the
number of classes of the data set. For regression problems, the output
size must be equal to the number of response variables.

You can also adjust the learning rate and the regularization parameters for this layer using
the related name-value pair arguments when creating the fully
connected layer. If you choose not to adjust them, then
`trainNetwork`

uses the global training
parameters defined by the `trainingOptions`

function. For details on global and layer training options, see Set Up Parameters and Train Convolutional Neural Network.

For classification problems, a softmax layer and then a
classification layer must follow the final fully connected
layer. You can create these layers using the `softmaxLayer`

and `classificationLayer`

functions,
respectively.

The output unit activation function is the softmax function:

$${y}_{r}\left(x\right)=\frac{\mathrm{exp}\left({a}_{r}\left(x\right)\right)}{{\displaystyle \sum _{j=1}^{k}\mathrm{exp}\left({a}_{j}\left(x\right)\right)}},$$

where $$0\le {y}_{r}\le 1$$ and $$\sum _{j=1}^{k}{y}_{j}=1$$.

The softmax function is the output unit activation function after the last fully connected layer for multi-class classification problems:

$$P\left({c}_{r}|x,\theta \right)=\frac{P\left(x,\theta |{c}_{r}\right)P\left({c}_{r}\right)}{{\displaystyle \sum _{j=1}^{k}P\left(x,\theta |{c}_{j}\right)P\left({c}_{j}\right)}}=\frac{\mathrm{exp}\left({a}_{r}\left(x,\theta \right)\right)}{{\displaystyle \sum _{j=1}^{k}\mathrm{exp}\left({a}_{j}\left(x,\theta \right)\right)}},$$

The softmax function is also known as the *normalized
exponential* and can be considered the
multi-class generalization of the logistic sigmoid function
[8].

A classification output layer must follow the softmax layer. In
the classification output layer,
`trainNetwork`

takes the values
from the softmax function and assigns each input to one of
the *k* mutually exclusive classes using
the cross entropy function for a 1-of-*k*
coding scheme [8]:

$$E\left(\theta \right)=-{\displaystyle \sum _{i=1}^{n}{\displaystyle \sum _{j=1}^{k}{t}_{ij}\mathrm{ln}{y}_{j}\left({x}_{i},\theta \right)}},$$

You can also use ConvNets for regression problems, where the
target (output) variable is continuous. In such cases, a
regression output layer must follow the final fully
connected layer. You can create a regression layer using the
`regressionLayer`

function. The default
loss function for a regression layer is the mean squared
error:

$$MSE=E\left(\theta \right)={\displaystyle \sum _{i=1}^{n}\frac{{\left({t}_{i}-{y}_{i}\right)}^{2}}{n}},$$

where $${t}_{i}$$ is the target output, and $${y}_{i}$$ is the network’s prediction for the
response variable corresponding to observation
*i*.

[1] Murphy, K. P. *Machine Learning:
A Probabilistic Perspective*. Cambridge, Massachusetts:
The MIT Press, 2012.

[2] Krizhevsky, A., I. Sutskever, and G. E.
Hinton. "ImageNet Classification with Deep Convolutional Neural Networks.
" *Advances in Neural Information Processing Systems*.
Vol 25, 2012.

[3] LeCun, Y., Boser, B., Denker, J.S., Henderson,
D., Howard, R.E., Hubbard, W., Jackel, L.D., et al. ''Handwritten
Digit Recognition with a Back-propagation Network.'' In *Advances
of Neural Information Processing Systems*, 1990.

[4] LeCun, Y., L. Bottou, Y. Bengio, and P.
Haffner. ''Gradient-based Learning Applied to Document Recognition.'' *Proceedings
of the IEEE.* Vol 86, pp. 2278–2324, 1998.

[5] Nair, V. and G. E. Hinton. "Rectified linear units improve restricted boltzmann machines." In Proc. 27th International Conference on Machine Learning, 2010.

[6] Nagi, J., F. Ducatelle, G. A. Di Caro, D.
Ciresan, U. Meier, A. Giusti, F. Nagi, J. Schmidhuber, L. M. Gambardella.
''Max-Pooling Convolutional Neural Networks for Vision-based Hand
Gesture Recognition''. *IEEE International Conference on
Signal and Image Processing Applications (ICSIPA2011)*,
2011.

[7] Srivastava, N., G. Hinton, A. Krizhevsky,
I. Sutskever, R. Salakhutdinov. "Dropout: A Simple Way to Prevent
Neural Networks from Overfitting." *Journal of Machine Learning
Research*. Vol. 15, pp. 1929-1958, 2014.

[8] Bishop, C. M. *Pattern Recognition
and Machine Learning*. Springer, New York, NY, 2006.

[9] Ioffe, Sergey, and Christian Szegedy. "Batch normalization:
Accelerating deep network training by reducing internal covariate
shift." *preprint, arXiv:1502.03167*
(2015).

`averagePooling2dLayer`

| `batchNormalizationLayer`

| `classificationLayer`

| `clippedReluLayer`

| `convolution2dLayer`

| `crossChannelNormalizationLayer`

| `dropoutLayer`

| `fullyConnectedLayer`

| `imageInputLayer`

| `leakyReluLayer`

| `maxPooling2dLayer`

| `regressionLayer`

| `reluLayer`

| `softmaxLayer`

| `trainNetwork`

| `trainingOptions`

Was this topic helpful?