Main Content

Create a Semantic Segmentation Network

Create a simple semantic segmentation network and learn about common layers found in many semantic segmentation networks. A common pattern in semantic segmentation networks requires the downsampling of an image between convolutional and ReLU layers, and then upsample the output to match the input size. This operation is analogous to the standard scale-space analysis using image pyramids. During this process however, a network performs the operations using non-linear filters optimized for a specific set of classes that you want to segment.

Create An Image Input Layer

A semantic segmentation network starts with an imageInputLayer, which defines the smallest image size the network can process. Most semantic segmentation networks are fully convolutional, which means they can process images that are larger than the specified input size. Here, an image size of [32 32 3] is used for the network to process 64x64 RGB images.

inputSize = [32 32 3];
imgLayer = imageInputLayer(inputSize)
imgLayer = 
  ImageInputLayer with properties:

                Name: ''
           InputSize: [32 32 3]

   Hyperparameters
    DataAugmentation: 'none'
       Normalization: 'zerocenter'

Create Downsampling Network

Start with the convolution and ReLU layers. The convolution layer padding is selected such that the output size of the convolution layer is the same as the input size. This makes it easier to construct a network because the input and output sizes between most layers remain the same as you progress through the network.

filterSize = 3;
numFilters = 32;
conv = convolution2dLayer(filterSize,numFilters,'Padding',1);
relu = reluLayer();

The downsampling is performed using a max pooling layer. Create a max pooling layer to downsample the input by a factor of 2 by setting the 'Stride' parameter to 2.

poolSize = 2;
maxPoolDownsample2x = maxPooling2dLayer(poolSize,'Stride',2);

Stack the convolution, ReLU, and max pooling layers to create a network that downsamples its input by a factor of 4.

downsamplingLayers = [
    conv
    relu
    maxPoolDownsample2x
    conv
    relu
    maxPoolDownsample2x
    ]
downsamplingLayers = 
  6x1 Layer array with layers:

     1   ''   Convolution   32 3x3 convolutions with stride [1  1] and padding [1  1  1  1]
     2   ''   ReLU          ReLU
     3   ''   Max Pooling   2x2 max pooling with stride [2  2] and padding [0  0  0  0]
     4   ''   Convolution   32 3x3 convolutions with stride [1  1] and padding [1  1  1  1]
     5   ''   ReLU          ReLU
     6   ''   Max Pooling   2x2 max pooling with stride [2  2] and padding [0  0  0  0]

Create Upsampling Network

The upsampling is done using the tranposed convolution layer (also commonly referred to as "deconv" or "deconvolution" layer). When a transposed convolution is used for upsampling, it performs the upsampling and the filtering at the same time.

Create a transposed convolution layer to upsample by 2.

filterSize = 4;
transposedConvUpsample2x = transposedConv2dLayer(4,numFilters,'Stride',2,'Cropping',1);

The 'Cropping' parameter is set to 1 to make the output size equal twice the input size.

Stack the transposed convolution and relu layers. An input to this set of layers is upsampled by 4.

upsamplingLayers = [
    transposedConvUpsample2x
    relu
    transposedConvUpsample2x
    relu
    ]
upsamplingLayers = 
  4x1 Layer array with layers:

     1   ''   Transposed Convolution   32 4x4 transposed convolutions with stride [2  2] and output cropping [1  1]
     2   ''   ReLU                     ReLU
     3   ''   Transposed Convolution   32 4x4 transposed convolutions with stride [2  2] and output cropping [1  1]
     4   ''   ReLU                     ReLU

Create A Pixel Classification Layer

The final set of layers are responsible for making pixel classifications. These final layers process an input that has the same spatial dimensions (height and width) as the input image. However, the number of channels (third dimension) is larger and is equal to number of filters in the last transposed convolution layer. This third dimension needs to be squeezed down to the number of classes we wish to segment. This can be done using a 1-by-1 convolution layer whose number of filters equal the number of classes, e.g. 3.

Create a convolution layer to combine the third dimension of the input feature maps down to the number of classes.

numClasses = 3;
conv1x1 = convolution2dLayer(1,numClasses);

Following this 1-by-1 convolution layer are the softmax and pixel classification layers. These two layers combine to predict the categorical label for each image pixel.

finalLayers = [
    conv1x1
    softmaxLayer()
    pixelClassificationLayer()
    ]
finalLayers = 
  3x1 Layer array with layers:

     1   ''   Convolution                  3 1x1 convolutions with stride [1  1] and padding [0  0  0  0]
     2   ''   Softmax                      softmax
     3   ''   Pixel Classification Layer   Cross-entropy loss 

Stack All Layers

Stack all the layers to complete the semantic segmentation network.

net = [
    imgLayer    
    downsamplingLayers
    upsamplingLayers
    finalLayers
    ]
net = 
  14x1 Layer array with layers:

     1   ''   Image Input                  32x32x3 images with 'zerocenter' normalization
     2   ''   Convolution                  32 3x3 convolutions with stride [1  1] and padding [1  1  1  1]
     3   ''   ReLU                         ReLU
     4   ''   Max Pooling                  2x2 max pooling with stride [2  2] and padding [0  0  0  0]
     5   ''   Convolution                  32 3x3 convolutions with stride [1  1] and padding [1  1  1  1]
     6   ''   ReLU                         ReLU
     7   ''   Max Pooling                  2x2 max pooling with stride [2  2] and padding [0  0  0  0]
     8   ''   Transposed Convolution       32 4x4 transposed convolutions with stride [2  2] and output cropping [1  1]
     9   ''   ReLU                         ReLU
    10   ''   Transposed Convolution       32 4x4 transposed convolutions with stride [2  2] and output cropping [1  1]
    11   ''   ReLU                         ReLU
    12   ''   Convolution                  3 1x1 convolutions with stride [1  1] and padding [0  0  0  0]
    13   ''   Softmax                      softmax
    14   ''   Pixel Classification Layer   Cross-entropy loss 

This network is ready to be trained using trainNetwork from Deep Learning Toolbox™.