Documentation

This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

Semantic Segmentation of Multispectral Images Using Deep Learning

This example shows how to train a U-Net convolutional neural network to perform semantic segmentation of a multispectral image with seven channels: three color channels, three near-infrared channels, and a mask.

The example shows how to train a U-Net network and also provides a pretrained U-Net network. If you choose to train the U-Net network, use of a CUDA-capable NVIDIA™ GPU with compute capability 3.0 or higher is highly recommended (requires Parallel Computing Toolbox™).

Introduction

Semantic segmentation involves labeling each pixel in an image with a class. One application of semantic segmentation is tracking deforestation, which is the change in forest cover over time. Environmental agencies track deforestation to assess and quantify the environmdental and ecological health of a region.

Deep-learning-based semantic segmentation can yield a precise measurement of vegetation cover from high-resolution aerial photographs. One challenge is differentiating classes with similar visual characteristics, such as trying to classify a green pixel as grass, shrubbery, or tree. To increase classification accuracy, some data sets contain multispectral images that provide additional information about each pixel. For example, the Hamlin Beach State Park data set supplements the color images with near-infrared channels that provide a clearer separation of the classes.

This example shows how to use deep-learning-based semantic segmentation techniques to calculate the percentage vegetation cover in a region from a set of multispectral images.

Download Data

This example uses a high-resolution multispectral data set to train the network [1]. The image set was captured using a drone over the Hamlin Beach State Park, NY. The data contains labeled training, validation, and test sets, with 18 object class labels.

Download the MAT-file version of the data set. You can use the helper function, downloadHamlinBeachMSIData, to download the data. The size of the data file is ~3.0 GB.

imageDir = tempdir;
url = 'http://www.cis.rit.edu/~rmk6217/rit18_data.mat';
downloadHamlinBeachMSIData(url,imageDir);

In addition, download a pretrained version of U-Net for this dataset. The pretrained model enables you to run the entire example without having to wait for training to complete.

trainedUnet_url = 'https://www.mathworks.com/supportfiles/vision/data/multispectralUnet.mat';
downloadTrainedUnet(trainedUnet_url,imageDir);

Inspect Training Data

Load the data set into the MATLAB® workspace.

load(fullfile(imageDir,'rit18_data','rit18_data.mat'));

Examine the structure of the data.

whos train_data val_data test_data
  Name            Size                         Bytes  Class     Attributes

  test_data       7x12446x7654            1333663576  uint16              
  train_data      7x9393x5642              741934284  uint16              
  val_data        7x8833x6918              855493716  uint16              

The multispectral image data is arranged as numChannels-by-width-by-height arrays. However, in MATLAB, multichannel images are arranged as width-by-height-by-numChannels arrays. To reshape the data so that the channels are in the third dimension, use the helper function, switchChannelsToThirdPlane.

train_data = switchChannelsToThirdPlane(train_data);
val_data   = switchChannelsToThirdPlane(val_data);
test_data  = switchChannelsToThirdPlane(test_data);

Confirm that the data has the correct structure.

whos train_data val_data test_data
  Name                Size                     Bytes  Class     Attributes

  test_data       12446x7654x7            1333663576  uint16              
  train_data       9393x5642x7             741934284  uint16              
  val_data         8833x6918x7             855493716  uint16              

The RGB color channels are the 4th, 5th, and 6th image channels. Display the color component of the training, validation, and test images as a montage. To make the images appear brighter on the screen, equalize their histograms by using the histeq function.

figure
montage(...
    {histeq(train_data(:,:,4:6)), ...
    histeq(val_data(:,:,4:6)), ...
    histeq(test_data(:,:,4:6))}, ...
    'BorderSize',10,'BackgroundColor','white')
title('RGB Component of Training Image (Left), Validation Image (Center), and Test Image (Right)')

Display the first three histogram-equalized channels of the training data as a montage. These channels correspond to the near-infrared bands and highlight different components of the image based on their heat signatures. For example, the trees near the center of the 2nd channel image show more detail than the trees in the other two channels.

figure
montage(...
    {histeq(train_data(:,:,1)), ...
    histeq(train_data(:,:,2)), ...
    histeq(train_data(:,:,3))}, ...
    'BorderSize',10,'BackgroundColor','white')
title('IR Channels 1 (Left), 2, (Center), and 3 (Right) of Training Image')

Channel 7 is a mask that indicates the valid segmentation region. Display the mask for the training, validation, and test images.

figure
montage(...
    {train_data(:,:,7), ...
    val_data(:,:,7), ...
    test_data(:,:,7)}, ...
    'BorderSize',10,'BackgroundColor','white')
title('Mask of Training Image (Left), Validation Image (Center), and Test Image (Right)')

The labeled images contain the ground truth data for the segmentation, with each pixel assigned to one of the 18 classes. Get a list of the classes with their corresponding IDs.

disp(classes)
0. Other Class/Image Border      
1. Road Markings                 
2. Tree                          
3. Building                      
4. Vehicle (Car, Truck, or Bus)  
5. Person                        
6. Lifeguard Chair               
7. Picnic Table                  
8. Black Wood Panel              
9. White Wood Panel              
10. Orange Landing Pad           
11. Water Buoy                   
12. Rocks                        
13. Other Vegetation             
14. Grass                        
15. Sand                         
16. Water (Lake)                 
17. Water (Pond)                 
18. Asphalt (Parking Lot/Walkway)

Create a vector of class names.

classNames = [ "RoadMarkings","Tree","Building","Vehicle","Person", ...
               "LifeguardChair","PicnicTable","BlackWoodPanel",...
               "WhiteWoodPanel","OrangeLandingPad","Buoy","Rocks",...
               "LowLevelVegetation","Grass_Lawn","Sand_Beach",...
               "Water_Lake","Water_Pond","Asphalt"]; 

Overlay the labels on the histogram-equalized RGB training image. Add a colorbar to the image.

cmap = jet(numel(classNames));
B = labeloverlay(histeq(train_data(:,:,4:6)),train_labels,'Transparency',0.8,'Colormap',cmap);

figure
title('Training Labels')
imshow(B)
Warning: Image is too big to fit on screen; displaying at 8%
N = numel(classNames);
ticks = 1/(N*2):1/N:1;
colorbar('TickLabels',cellstr(classNames),'Ticks',ticks,'TickLength',0,'TickLabelInterpreter','none');
colormap(cmap)

Save the training data as a .MAT file and the training labels as a .PNG file.

save('train_data.mat','train_data');
imwrite(train_labels,'train_labels.png');

Define Mini-Batch Datastore for Training

A mini-batch datastore is used to feed the training data to the network. This example defines a custom implementation of a mini-bach datastore, called PixelLabelImagePatchDatastore, as a convenient way to generate augmented image and label patches for training the semantic segmentation network.

Begin by storing the training images from 'train_data.mat' in an imageDatastore. Because the MAT file format is a nonstandard image format, you must use a MAT file reader to enable reading the image data. You can use the helper MAT file reader, matReader, that extracts the first six channels from the training data and omits the last channel containing the mask.

imds = imageDatastore('train_data.mat','FileExtensions','.mat','ReadFcn',@matReader);

Create a pixelLabelDatastore to store the label patches containing the 18 labeled regions.

pixelLabelIds = 1:18;
pxds = pixelLabelDatastore('train_labels.png',classNames,pixelLabelIds);

Create the PixelLabelImagePatchDatastore from the image datastore and the pixel label datastore. The PixelLabelImagePatchDatastore extracts pairs of patches from the training image and labels. It also augments the training data to increase the size of the training data set and to make the training more robust. To mimic aerial landscape data collected by a drone, the PixelLabelImagePatchDatastore augments the pairs of patches by adding random rotation, reflection, and scaling.

Each mini-batch contains 16 patches of size 256-by-256 pixels. One thousand mini-batches are extracted at each iteration of the epoch. All patches are extracted from random positions in the image.

ds = PixelLabelImagePatchDatastore(imds,pxds,'PatchSize',256,...
                                             'MiniBatchSize',16,...
                                             'BatchesPerImage',1000)     
ds = 
  PixelLabelImagePatchDatastore with properties:

      MiniBatchSize: 16
    NumObservations: 16000

To understand what gets passed into the network during training, read the first mini-batch from the PixelLabelImagePatchDatastore.

sampleBatch = read(ds)
sampleBatch=16×2 table
        imPatches             labelPatches     
    __________________    _____________________

    [256×256×6 uint16]    [256×256 categorical]
    [256×256×6 uint16]    [256×256 categorical]
    [256×256×6 uint16]    [256×256 categorical]
    [256×256×6 uint16]    [256×256 categorical]
    [256×256×6 uint16]    [256×256 categorical]
    [256×256×6 uint16]    [256×256 categorical]
    [256×256×6 uint16]    [256×256 categorical]
    [256×256×6 uint16]    [256×256 categorical]
    [256×256×6 uint16]    [256×256 categorical]
    [256×256×6 uint16]    [256×256 categorical]
    [256×256×6 uint16]    [256×256 categorical]
    [256×256×6 uint16]    [256×256 categorical]
    [256×256×6 uint16]    [256×256 categorical]
    [256×256×6 uint16]    [256×256 categorical]
    [256×256×6 uint16]    [256×256 categorical]
    [256×256×6 uint16]    [256×256 categorical]

Reset the datastore to its original state after reading the sample batch.

reset(ds)

Create U-Net Network Layers

This example uses a variation of the U-Net network. In U-Net, the initial series of convolutional layers are interspersed with max pooling layers, successively decreasing the resolution of the input image. These layers are followed by a series of convolutional layers interspersed with upsampling operators, successively increasing the resolution of the input image [2]. The name U-Net comes from the fact that the network can be drawn with a symmetric shape like the letter U.

This example modifies the U-Net to use zero-padding in the convolutions, so that the input and the output to the convolutions have the same size. Use the helper function, createUnet, to create a U-Net with a few preselected hyperparameters.

inputTileSize = [256,256,6];
lgraph = createUnet(inputTileSize);
disp(lgraph.Layers)
  58x1 Layer array with layers:

     1   'ImageInputLayer'                        Image Input                  256x256x6 images with 'zerocenter' normalization
     2   'Encoder-Section-1-Conv-1'               Convolution                  64 3x3x6 convolutions with stride [1  1], dilation factor [1  1] and padding [1  1  1  1]
     3   'Encoder-Section-1-ReLU-1'               ReLU                         ReLU
     4   'Encoder-Section-1-Conv-2'               Convolution                  64 3x3x64 convolutions with stride [1  1], dilation factor [1  1] and padding [1  1  1  1]
     5   'Encoder-Section-1-ReLU-2'               ReLU                         ReLU
     6   'Encoder-Section-1-MaxPool'              Max Pooling                  2x2 max pooling with stride [2  2] and padding [0  0  0  0]
     7   'Encoder-Section-2-Conv-1'               Convolution                  128 3x3x64 convolutions with stride [1  1], dilation factor [1  1] and padding [1  1  1  1]
     8   'Encoder-Section-2-ReLU-1'               ReLU                         ReLU
     9   'Encoder-Section-2-Conv-2'               Convolution                  128 3x3x128 convolutions with stride [1  1], dilation factor [1  1] and padding [1  1  1  1]
    10   'Encoder-Section-2-ReLU-2'               ReLU                         ReLU
    11   'Encoder-Section-2-MaxPool'              Max Pooling                  2x2 max pooling with stride [2  2] and padding [0  0  0  0]
    12   'Encoder-Section-3-Conv-1'               Convolution                  256 3x3x128 convolutions with stride [1  1], dilation factor [1  1] and padding [1  1  1  1]
    13   'Encoder-Section-3-ReLU-1'               ReLU                         ReLU
    14   'Encoder-Section-3-Conv-2'               Convolution                  256 3x3x256 convolutions with stride [1  1], dilation factor [1  1] and padding [1  1  1  1]
    15   'Encoder-Section-3-ReLU-2'               ReLU                         ReLU
    16   'Encoder-Section-3-MaxPool'              Max Pooling                  2x2 max pooling with stride [2  2] and padding [0  0  0  0]
    17   'Encoder-Section-4-Conv-1'               Convolution                  512 3x3x256 convolutions with stride [1  1], dilation factor [1  1] and padding [1  1  1  1]
    18   'Encoder-Section-4-ReLU-1'               ReLU                         ReLU
    19   'Encoder-Section-4-Conv-2'               Convolution                  512 3x3x512 convolutions with stride [1  1], dilation factor [1  1] and padding [1  1  1  1]
    20   'Encoder-Section-4-ReLU-2'               ReLU                         ReLU
    21   'Encoder-Section-4-DropOut'              Dropout                      50% dropout
    22   'Encoder-Section-4-MaxPool'              Max Pooling                  2x2 max pooling with stride [2  2] and padding [0  0  0  0]
    23   'Mid-Conv-1'                             Convolution                  1024 3x3x512 convolutions with stride [1  1], dilation factor [1  1] and padding [1  1  1  1]
    24   'Mid-ReLU-1'                             ReLU                         ReLU
    25   'Mid-Conv-2'                             Convolution                  1024 3x3x1024 convolutions with stride [1  1], dilation factor [1  1] and padding [1  1  1  1]
    26   'Mid-ReLU-2'                             ReLU                         ReLU
    27   'Mid-DropOut'                            Dropout                      50% dropout
    28   'Decoder-Section-1-UpConv'               Transposed Convolution       512 2x2x1024 transposed convolutions with stride [2  2] and output cropping [0  0]
    29   'Decoder-Section-1-UpReLU'               ReLU                         ReLU
    30   'Decoder-Section-1-DepthConcatenation'   Depth concatenation          Depth concatenation of 2 inputs
    31   'Decoder-Section-1-Conv-1'               Convolution                  512 3x3x1024 convolutions with stride [1  1], dilation factor [1  1] and padding [1  1  1  1]
    32   'Decoder-Section-1-ReLU-1'               ReLU                         ReLU
    33   'Decoder-Section-1-Conv-2'               Convolution                  512 3x3x512 convolutions with stride [1  1], dilation factor [1  1] and padding [1  1  1  1]
    34   'Decoder-Section-1-ReLU-2'               ReLU                         ReLU
    35   'Decoder-Section-2-UpConv'               Transposed Convolution       256 2x2x512 transposed convolutions with stride [2  2] and output cropping [0  0]
    36   'Decoder-Section-2-UpReLU'               ReLU                         ReLU
    37   'Decoder-Section-2-DepthConcatenation'   Depth concatenation          Depth concatenation of 2 inputs
    38   'Decoder-Section-2-Conv-1'               Convolution                  256 3x3x512 convolutions with stride [1  1], dilation factor [1  1] and padding [1  1  1  1]
    39   'Decoder-Section-2-ReLU-1'               ReLU                         ReLU
    40   'Decoder-Section-2-Conv-2'               Convolution                  256 3x3x256 convolutions with stride [1  1], dilation factor [1  1] and padding [1  1  1  1]
    41   'Decoder-Section-2-ReLU-2'               ReLU                         ReLU
    42   'Decoder-Section-3-UpConv'               Transposed Convolution       128 2x2x256 transposed convolutions with stride [2  2] and output cropping [0  0]
    43   'Decoder-Section-3-UpReLU'               ReLU                         ReLU
    44   'Decoder-Section-3-DepthConcatenation'   Depth concatenation          Depth concatenation of 2 inputs
    45   'Decoder-Section-3-Conv-1'               Convolution                  128 3x3x256 convolutions with stride [1  1], dilation factor [1  1] and padding [1  1  1  1]
    46   'Decoder-Section-3-ReLU-1'               ReLU                         ReLU
    47   'Decoder-Section-3-Conv-2'               Convolution                  128 3x3x128 convolutions with stride [1  1], dilation factor [1  1] and padding [1  1  1  1]
    48   'Decoder-Section-3-ReLU-2'               ReLU                         ReLU
    49   'Decoder-Section-4-UpConv'               Transposed Convolution       64 2x2x128 transposed convolutions with stride [2  2] and output cropping [0  0]
    50   'Decoder-Section-4-UpReLU'               ReLU                         ReLU
    51   'Decoder-Section-4-DepthConcatenation'   Depth concatenation          Depth concatenation of 2 inputs
    52   'Decoder-Section-4-Conv-1'               Convolution                  64 3x3x128 convolutions with stride [1  1], dilation factor [1  1] and padding [1  1  1  1]
    53   'Decoder-Section-4-ReLU-1'               ReLU                         ReLU
    54   'Decoder-Section-4-Conv-2'               Convolution                  64 3x3x64 convolutions with stride [1  1], dilation factor [1  1] and padding [1  1  1  1]
    55   'Decoder-Section-4-ReLU-2'               ReLU                         ReLU
    56   'Final-ConvolutionLayer'                 Convolution                  18 1x1x64 convolutions with stride [1  1], dilation factor [1  1] and padding [0  0  0  0]
    57   'Softmax-Layer'                          Softmax                      softmax
    58   'Segmentation-Layer'                     Pixel Classification Layer   Cross-entropy loss 

Select Training Options

Train the network using stochastic gradient descent with momentum (SGDM) optimization. Specify the hyperparameter settings for SDGM by using the trainingOptions function.

Training a deep network is time-consuming. Accelerate the training by specifying a high learning rate. However, this can cause the gradients of the network to explode or grow uncontrollably, preventing the network from training successfully. To keep the gradients in a meaningful range, enable gradient clipping by setting 'GradientThreshold' to a value of 0.05, and specify 'GradientThresholdMethod' to use the L2-norm of the gradients.

initialLearningRate = 0.05;
maxEpochs = 150;
minibatchSize = 16;
l2reg = 0.0001;

options = trainingOptions('sgdm',...
    'InitialLearnRate', initialLearningRate, ...
    'Momentum',0.9,...
    'L2Regularization',l2reg,...
    'MaxEpochs',maxEpochs,...
    'MiniBatchSize',minibatchSize,...
    'VerboseFrequency',20,...
    'LearnRateSchedule','piecewise',...    
    'Shuffle','every-epoch',...
    'Plots','training-progress',...
    'GradientThresholdMethod','l2norm',...
    'GradientThreshold',0.05);

Train the Network

After configuring the training options and the mini-batch datastore, train the U-Net network by using the trainNetwork function. To train the network, set the doTraining parameter in the following code to true. A CUDA-capable NVIDIA™ GPU with compute capability 3.0 or higher is highly recommended for training.

If you keep the doTraining parameter in the following code as false, then the example returns a pretrained U-Net network.

Note: Training takes about 20 hours on an NVIDIA(TM) Titan X and can take even longer depending on your GPU hardware.

doTraining = false; 
if doTraining     
    [net,info] = trainNetwork(ds,lgraph,options); 
else 
    load(fullfile(imageDir,'trainedUnet','multispectralUnet.mat'));
end

You can now use the U-Net to semantically segment the multispectral image.

Predict Results on Test Data

To perform the forward pass on the trained network, use the helper function, segmentImage, with the validation data set. segmentImage performs segmentation on image patches using the semanticseg function.

predictPatchSize = [1024 1024];
segmentedImage = segmentImage(val_data,net,predictPatchSize);

To extract only the valid portion of the segmentation, multiply the segmented image by the mask channel of the validation data.

segmentedImage = uint8(val_data(:,:,7)~=0) .* segmentedImage;

figure
imshow(segmentedImage,[])
Warning: Image is too big to fit on screen; displaying at 8%
title('Segmented Image')

The output of semantic segmentation is noisy. Perform post image processing to remove noise and stray pixels. Use the medfilt2 function to remove salt-and-pepper noise from the segmentation. Visualize the segmented image with the noise removed.

segmentedImage = medfilt2(segmentedImage,[7,7]);
imshow(segmentedImage,[]);
Warning: Image is too big to fit on screen; displaying at 8%
title('Segmented Image  with Noise Removed')

Overlay the segmented image on the histogram-equalized RGB validation image.

B = labeloverlay(histeq(val_data(:,:,4:6)),segmentedImage,'Transparency',0.8,'Colormap',cmap);

figure
imshow(B)
Warning: Image is too big to fit on screen; displaying at 8%
title('Labeled Validation Image')
colorbar('TickLabels',cellstr(classNames),'Ticks',ticks,'TickLength',0,'TickLabelInterpreter','none');
colormap(cmap)

Save the segmented image and ground truth labels as .PNG files. These will be used to compute accuracy metrics.

imwrite(segmentedImage,'results.png');
imwrite(val_labels,'gtruth.png');

Quantify Segmentation Accuracy

Create a pixelLabelDatastore for the segmentation results and the ground truth labels.

pxdsResults = pixelLabelDatastore('results.png',classNames,pixelLabelIds);
pxdsTruth = pixelLabelDatastore('gtruth.png',classNames,pixelLabelIds);

Measure the global accuracy of the semantic segmentation by using the evaluateSemanticSegmentation function.

ssm = evaluateSemanticSegmentation(pxdsResults,pxdsTruth,'Metrics','global-accuracy');
Evaluating semantic segmentation results
----------------------------------------
* Selected metrics: global accuracy.
* Processing 1 images...
[==================================================] 100%
Elapsed time: 00:00:25
Estimated time remaining: 00:00:00
* Finalizing... Done.
* Data set metrics:

    GlobalAccuracy
    ______________

       0.90698    

The global accuracy score indicates that just over 90% of the pixels are classified correctly.

Calculate Extent of Vegetation Cover

The final goal of this example is to calculate the extent of vegetation cover in the multispectral image.

Find the number of pixels labeled vegetation. The label IDs 2 ("Trees"), 13 ("LowLevelVegetation"), and 14 ("Grass_Lawn") are the vegetation classes. Also find the total number of valid pixels by summing the pixels in the ROI of the mask image.

vegetationClassIds = uint8([2,13,14]);
vegetationPixels = ismember(segmentedImage(:),vegetationClassIds);
validPixels = (segmentedImage~=0);

numVegetationPixels = sum(vegetationPixels(:));
numValidPixels = sum(validPixels(:));

Calculate the percentage of vegetation cover by dividing the number of vegetation pixels by the number of valid pixels.

percentVegetationCover = (numVegetationPixels/numValidPixels)*100;
fprintf('The percentage of vegetation cover is %3.2f%%.',percentVegetationCover);
The percentage of vegetation cover is 51.72%.

Summary

This example shows how to create and train a U-Net network for semantic segmentation of a seven-channel multispectral image. These were the steps to train the network:

  • Download and reshape the training data.

  • Define a custom mini-batch datastore, called a PixelLabelImagePatchDatastore from the image datastore and the pixel label datastore. The PixelLabelImagePatchDatastore extracts patches from the input compressed image and compute the target residuals from the corresponding patches in the pristine images. This datastore was used to feed training data to the network.

  • Define the layers of the U-Net network.

  • Specify training options.

  • Train the network using the trainNetwork function.

After training the U-Net network or loading a pretrained U-Net network, the example performs semantic segmentation of the validation data and measures the segmentation accuracy.

References

[1] Kemker, R., C. Salvaggio, and C. Kanan. "High-Resolution Multispectral Dataset for Semantic Segmentation." CoRR, abs/1703.01918. 2017.

[2] Ronneberger, O., P. Fischer, and T. Brox. "U-Net: Convolutional Networks for Biomedical Image Segmentation." CoRR, abs/1505.04597. 2015.

See Also

| | | | | | |

Related Topics

External Websites

Was this topic helpful?