Create a Semantic Segmentation Network

This example uses:

Create a simple semantic segmentation network and learn about common layers found in many semantic segmentation networks. A common pattern in semantic segmentation networks requires the downsampling of an image between convolutional and ReLU layers, and then upsample the output to match the input size. This operation is analogous to the standard scale-space analysis using image pyramids. During this process however, a network performs the operations using non-linear filters optimized for a specific set of classes that you want to segment.

Create Image Input Layer

A semantic segmentation network starts with an imageInputLayer, which defines the smallest image size the network can process. Most semantic segmentation networks are fully convolutional, which means they can process images that are larger than the specified input size. Here, an image size of [32 32 3] is used for the network to process 64-by-64 RGB images.

inputSize = [32 32 3];
imgLayer = imageInputLayer(inputSize)

imgLayer = 
  ImageInputLayer with properties:

                      Name: ''
                 InputSize: [32 32 3]
        SplitComplexInputs: 0

   Hyperparameters
          DataAugmentation: 'none'
             Normalization: 'zerocenter'
    NormalizationDimension: 'auto'
                      Mean: []

Create Downsampling Network

Start with the convolution and ReLU layers. The convolution layer padding is selected such that the output size of the convolution layer is the same as the input size. This makes it easier to construct a network because the input and output sizes between most layers remain the same as you progress through the network.

filterSize = 3;
numFilters = 32;
conv = convolution2dLayer(filterSize,numFilters,Padding=1);
relu = reluLayer();

The downsampling is performed using a max pooling layer. Create a max pooling layer to downsample the input by a factor of 2 by setting the 'Stride' parameter to 2.

poolSize = 2;
maxPoolDownsample2x = maxPooling2dLayer(poolSize,Stride=2);

Stack the convolution, ReLU, and max pooling layers to create a network that downsamples its input by a factor of 4.

downsamplingLayers = [
    conv
    relu
    maxPoolDownsample2x
    conv
    relu
    maxPoolDownsample2x
    ]

downsamplingLayers = 
  6×1 Layer array with layers:

     1   ''   2-D Convolution   32 3×3 convolutions with stride [1  1] and padding [1  1  1  1]
     2   ''   ReLU              ReLU
     3   ''   2-D Max Pooling   2×2 max pooling with stride [2  2] and padding [0  0  0  0]
     4   ''   2-D Convolution   32 3×3 convolutions with stride [1  1] and padding [1  1  1  1]
     5   ''   ReLU              ReLU
     6   ''   2-D Max Pooling   2×2 max pooling with stride [2  2] and padding [0  0  0  0]

Create Upsampling Network

The upsampling is done using the transposed convolution layer (also commonly referred to as "deconv" or "deconvolution" layer). When a transposed convolution is used for upsampling, it performs the upsampling and the filtering at the same time.

Create a transposed convolution layer to upsample by 2.

filterSize = 4;
transposedConvUpsample2x = transposedConv2dLayer(4,numFilters,Stride=2,Cropping=1);

The 'Cropping' parameter is set to 1 to make the output size equal twice the input size.

Stack the transposed convolution and relu layers. An input to this set of layers is upsampled by 4.

upsamplingLayers = [
    transposedConvUpsample2x
    relu
    transposedConvUpsample2x
    relu
    ]

upsamplingLayers = 
  4×1 Layer array with layers:

     1   ''   2-D Transposed Convolution   32 4×4 transposed convolutions with stride [2  2] and cropping [1  1  1  1]
     2   ''   ReLU                         ReLU
     3   ''   2-D Transposed Convolution   32 4×4 transposed convolutions with stride [2  2] and cropping [1  1  1  1]
     4   ''   ReLU                         ReLU

Create Final Layers for Pixel Classification

The final set of layers are responsible for making pixel classifications. These final layers process an input that has the same spatial dimensions (height and width) as the input image. However, the number of channels (third dimension) is larger and is equal to number of filters in the last transposed convolution layer. This third dimension needs to be squeezed down to the number of classes you want to segment. This can be done using a 1-by-1 convolution layer whose number of filters equal the number of classes, such as 3.

Create a convolution layer to combine the third dimension of the input feature maps down to the number of classes.

numClasses = 3;
conv1x1 = convolution2dLayer(1,numClasses);

Following this 1-by-1 convolution layer is a softmax layer. This layer applies a softmax activation function that normalizes the output of the fully connected layer. The output of the softmax layer consists of positive numbers that sum to one, which can be considered as classification probabilities.

finalLayers = [
    conv1x1
    softmaxLayer()
    ]

finalLayers = 
  2×1 Layer array with layers:

     1   ''   2-D Convolution   3 1×1 convolutions with stride [1  1] and padding [0  0  0  0]
     2   ''   Softmax           softmax

Stack All Layers

Stack all the layers to complete the semantic segmentation network.

net = [
    imgLayer    
    downsamplingLayers
    upsamplingLayers
    finalLayers
    ]

net = 
  13×1 Layer array with layers:

     1   ''   Image Input                  32×32×3 images with 'zerocenter' normalization
     2   ''   2-D Convolution              32 3×3 convolutions with stride [1  1] and padding [1  1  1  1]
     3   ''   ReLU                         ReLU
     4   ''   2-D Max Pooling              2×2 max pooling with stride [2  2] and padding [0  0  0  0]
     5   ''   2-D Convolution              32 3×3 convolutions with stride [1  1] and padding [1  1  1  1]
     6   ''   ReLU                         ReLU
     7   ''   2-D Max Pooling              2×2 max pooling with stride [2  2] and padding [0  0  0  0]
     8   ''   2-D Transposed Convolution   32 4×4 transposed convolutions with stride [2  2] and cropping [1  1  1  1]
     9   ''   ReLU                         ReLU
    10   ''   2-D Transposed Convolution   32 4×4 transposed convolutions with stride [2  2] and cropping [1  1  1  1]
    11   ''   ReLU                         ReLU
    12   ''   2-D Convolution              3 1×1 convolutions with stride [1  1] and padding [0  0  0  0]
    13   ''   Softmax                      softmax

This network is ready to be trained using trainnet from Deep Learning Toolbox™.