Main Content

Instance Segmentation Using Mask R-CNN Deep Learning

This example shows how to segment individual instances of people and cars using a multiclass Mask region-based convolutional neural network (R-CNN).

Instance segmentation is a computer vision technique in which you detect and localize objects while simultaneously generating a segmentation map for each of the detected instances.

This example first shows how to perform instance segmentation using a pretrained Mask R-CNN that detects two classes. Then, you can optionally download a data set and train a multiclass Mask R-CNN.

Perform Instance Segmentation Using Pretrained Mask R-CNN

Download the pretrained Mask R-CNN.

dataFolder = fullfile(tempdir,"coco");
trainedMaskRCNN_url = 'https://www.mathworks.com/supportfiles/vision/data/maskrcnn_pretrained_person_car.mat';
helper.downloadTrainedMaskRCNN(trainedMaskRCNN_url,dataFolder);
pretrained = load(fullfile(dataFolder,'maskrcnn_pretrained_person_car.mat'));
net = pretrained.net;

Extract the mask segmentation subnetwork using the extractMaskNetwork helper function, which is attached to this example as a supporting file in the folder helper.

maskSubnet = helper.extractMaskNetwork(net);

The network is trained to detect people and cars. Specify the class names, including the 'background' class, as well as the number of classes excluding the 'background' class.

classNames = {'person','car','background'};
numClasses = length(classNames)-1;

Read a test image that contains objects of the target classes.

imTest = imread('visionteam.jpg');

Define the target size of the image for inference.

targetSizeTest = [700 700 3];

Resize the image, maintaining the aspect ratio and scaling the largest dimension to the target size.

if size(imTest,1) > size(imTest,2)
   imTest = imresize(imTest,[targetSizeTest(1) NaN]); 
else
   imTest = imresize(imTest,[NaN targetSizeTest(2)]);     
end

Specify network configuration parameters using the createMaskRCNNConfig helper function, which is attached to this example as a supporting file.

imageSizeTrain = [800 800 3];
params = createMaskRCNNConfig(imageSizeTrain,numClasses,classNames);

Detect the objects and their masks using the helper function detectMaskRCNN, which is attached to this example as a supporting file.

[boxes,scores,labels,masks] = detectMaskRCNN(net,maskSubnet,imTest,params);

Visualize the predictions by overlaying the detected masks on the image using the insertObjectMask function.

if(isempty(masks))
    overlayedImage = imTest;
else
    overlayedImage = insertObjectMask(imTest,masks);
end
imshow(overlayedImage)

Show the bounding boxes and labels on the objects.

showShape("rectangle",gather(boxes),"Label",labels,"LineColor",'r')

Download Training Data

The COCO 2014 train images data set [2] consists of 82,783 images. The annotations data contains at least five captions corresponding to each image.

Create directories to store the COCO training images and annotation data.

imageFolder = fullfile(dataFolder,"images");
captionsFolder = fullfile(dataFolder,"annotations");
if ~exist(imageFolder,'dir')
    mkdir(imageFolder)
    mkdir(captionsFolder)
end

Download the COCO 2014 training images and captions from https://cocodataset.org/#download by clicking the "2014 Train images" and "2014 Train/Val annotations" links, respectively. Extract the image files into the folder specified by imageFolder. Extract the annotation files into the folder specified by captionsFolder.

annotationFile = fullfile(captionsFolder,"instances_train2014.json");
str = fileread(annotationFile);

Read and Preprocess Training Data

To train a Mask R-CNN, you need this data.

  • RGB images that serve as input to the network, specified as H-by-W-by-3 numeric arrays.

  • Bounding boxes for objects in the RGB images, specified as NumObjects-by-4 matrices, with rows in the format [x y w h]).

  • Instance labels, specified as NumObjects-by-1 string vectors.

  • Instance masks. Each mask is the segmentation of one instance in the image. The COCO data set specifies object instances using polygon coordinates formatted as NumObjects-by-2 cell arrays. Each row of the array contains the (x,y) coordinates of a polygon along the boundary of one instance in the image. However, the Mask R-CNN in this example requires binary masks specified as logical arrays of size H-by-W-by-NumObjects.

Format COCO Annotation Data as MAT Files

The COCO API for MATLAB enables you to access the annotation data. Download the COCO API for MATLAB from https://github.com/cocodataset/cocoapi by clicking the "Code" button and selecting "Download ZIP." Extract the cocoapi-master directory and its contents to the folder specified by dataFolder. If needed for your operating system, compile the gason parser by following the instructions in the gason.m file within the MatlabAPI subdirectory.

Specify the directory location for the COCO API for MATLAB and add the directory to the path.

cocoAPIDir = fullfile(dataFolder,"cocoapi-master","MatlabAPI");
addpath(cocoAPIDir);

Specify the folder in which to store the MAT files.

unpackAnnotationDir = fullfile(dataFolder,"annotations_unpacked","matFiles");
if ~exist(unpackAnnotationDir,'dir')
    mkdir(unpackAnnotationDir)
end

Extract the COCO annotations to MAT files using the unpackAnnotations helper function, which is attached to this example as a supporting file in the folder helper. Each MAT file corresponds to a single training image and contains the file name, bounding boxes, instance labels, and instance masks for each training image. The function converts object instances specified as polygon coordinates to binary masks using the poly2mask function.

trainClassNames = {'person','car'};
helper.unpackAnnotations(trainClassNames,annotationFile,imageFolder,unpackAnnotationDir);

Create Datastore

The Mask R-CNN expects input data as a 1-by-4 cell array containing the RGB training image, bounding boxes, instance labels, and instance masks.

Create a file datastore with a custom read function, cocoAnnotationMATReader, that reads the content of the unpacked annotation MAT files, converts grayscale training images to RGB, and returns the data as a 1-by-4 cell array in the required format. The custom read function is attached to this example as a supporting file in the folder helper.

ds = fileDatastore(unpackAnnotationDir, ...
    'ReadFcn',@(x)helper.cocoAnnotationMATReader(x,imageFolder));

Specify the input size of the network.

imageSize = [800 800 3];

Preprocess the training images, bounding boxes, and instance masks to the size expected by the network using the transform function. The transform function processes the data using the operations specified in the preprocessData helper function. The helper function is attached to the example as a supporting file in the folder helper.

The preprocessData helper function performs these operations on the training images, bounding boxes, and instance masks:

  • Resize the RGB images and masks using the imresize function and rescale the bounding boxes using the bboxresize function. The helper function selects a homogenous scale factor such that the smaller dimension of the image, bounding box, or mask is equal to the target network input size.

  • Crop the RGB images and masks using the imcrop function and crop the bounding boxes using the bboxcrop function. The helper function crops the image, bounding box, or mask such that the larger dimension is equal to the target network input size.

  • Scale the pixel values of the RGB images to the range [0, 1].

dsTrain = transform(ds,@(x)helper.preprocessData(x,imageSize));

Preview the data returned by the transformed datastore.

data = preview(dsTrain)
data=1×4 cell array
    {800×800×3 uint8}    {16×4 double}    {16×1 categorical}    {800×800×16 logical}

Create Mask R-CNN Network Layers

The Mask R-CNN builds upon a Faster R-CNN with a ResNet-101 base network. Get the Faster R-CNN layers using the fasterRCNNLayers function.

netFasterRCNN = fasterRCNNLayers(params.ImageSize,numClasses,params.AnchorBoxes,'resnet101');

Modify the network for Mask R-CNN using the createMaskRCNN helper function. This function is attached to the example as a supporting file. The helper function performs these modifications to the network:

  1. Replace the rpnSoftmaxLayer with a custom RPM softmax layer, defined by the supporting file RPNSoftmax in the folder layer.

  2. Replace the regionProposalLayer with a custom region proposal layer, defined by the supporting file RegionProposal in the folder layer.

  3. Replace the roiMaxPooling2dLayer with an roiAlignLayer.

  4. Add a mask segmentation head for pixel-level segmentation.

netMaskRCNN = createMaskRCNN(netFasterRCNN,numClasses,params);

Convert the network to a dlnetwork (Deep Learning Toolbox) object.

dlnet = dlnetwork(netMaskRCNN);

Visualize the network using Deep Network Designer.

deepNetworkDesigner(netMaskRCNN)

Specify Training Options

Specify the options for SGDM optimization. Train the network for 30 epochs.

initialLearnRate = 0.01;
momentum = 0.9;
decay = 0.0001;
velocity = [];
maxEpochs = 30;
miniBatchSize = 2;

Batch Training Data

Create a minibatchqueue (Deep Learning Toolbox) object that manages the mini-batching of observations in a custom training loop. The minibatchqueue object also casts data to a dlarray (Deep Learning Toolbox) object that enables automatic differentiation in deep learning applications.

Define a custom batching function named miniBatchFcn. The images are concatenated along the fourth dimension to get an H-by-W-by-C-by-miniBatchSize shaped batch. The other ground truth data is configured a cell array of length equal to the mini-batch size.

miniBatchFcn = @(img,boxes,labels,masks) deal(cat(4,img{:}),boxes,labels,masks);

Specify the mini-batch data extraction format for the image data as "SSCB" (spatial, spatial, channel, batch). If a supported GPU is available for computation, then the minibatchqueue object preprocesses mini-batches in the background in a parallel pool during training.

mbqTrain = minibatchqueue(dsTrain,4, ...
    "MiniBatchFormat",["SSCB","","",""], ...
    "MiniBatchSize",miniBatchSize, ...
    "OutputCast",["single","","",""], ...
    "OutputAsDlArray",[true,false,false,false], ...
    "MiniBatchFcn",miniBatchFcn, ...
    "OutputEnvironment",["auto","cpu","cpu","cpu"]);

Train Network

To train the network, set the doTraining variable in the following code to true. Train the model in a custom training loop. For each iteration:

  • Read the data for the current mini-batch using the next (Deep Learning Toolbox) function.

  • Evaluate the model gradients using the dlfeval (Deep Learning Toolbox) function and the networkGradients helper function. The function networkGradients, listed as a supporting function, returns the gradients of the loss with respect to the learnable parameters, the corresponding mini-batch loss, and the state of the current batch.

  • Update the network parameters using the sgdmupdate (Deep Learning Toolbox) function.

  • Update the state parameters of the network with the moving average.

  • Update the training progress plot.

Train on a GPU if one is available. Using a GPU requires Parallel Computing Toolbox™ and a CUDA® enabled NVIDIA® GPU. For more information, see GPU Support by Release (Parallel Computing Toolbox).

doTraining = false;
if doTraining
    
    iteration = 1; 
    start = tic;
    
     % Create subplots for the learning rate and mini-batch loss
    fig = figure;
    [lossPlotter] = helper.configureTrainingProgressPlotter(fig);
    
    % Initialize verbose output
    helper.initializeVerboseOutput([]);
    
    % Custom training loop
    for epoch = 1:maxEpochs
        reset(mbqTrain)
        shuffle(mbqTrain)
    
        while hasdata(mbqTrain)
            % Get next batch from minibatchqueue
            [X,gtBox,gtClass,gtMask] = next(mbqTrain);
        
            % Evaluate the model gradients and loss
            [gradients,loss,state] = dlfeval(@networkGradients,X,gtBox,gtClass,gtMask,dlnet,params);
            dlnet.State = state;
            
            % Compute the learning rate for the current iteration
            learnRate = initialLearnRate/(1 + decay*iteration);
            
            if(~isempty(gradients) && ~isempty(loss))    
                [dlnet.Learnables,velocity] = sgdmupdate(dlnet.Learnables,gradients,velocity,learnRate,momentum);
            else
                continue;
            end
            
            helper.displayVerboseOutputEveryEpoch(start,learnRate,epoch,iteration,loss);
                
            % Plot loss/accuracy metric
            D = duration(0,0,toc(start),'Format','hh:mm:ss');
            addpoints(lossPlotter,iteration,double(gather(extractdata(loss))))
            subplot(2,1,2)
            title(strcat("Epoch: ",num2str(epoch),", Elapsed: "+string(D)))
            drawnow
            
            iteration = iteration + 1;    
        end
    
    end
    net = dlnet;
    
    % Save the trained network
    modelDateTime = string(datetime('now','Format',"yyyy-MM-dd-HH-mm-ss"));
    save(strcat("trainedMaskRCNN-",modelDateTime,"-Epoch-",num2str(maxEpochs),".mat"),'net');
    
end

Using the trained network, you can perform instance segmentation on test images, such as demonstrated in the section Perform Instance Segmentation Using Pretrained Mask R-CNN.

References

[1] He, Kaiming, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. “Mask R-CNN.” Preprint, submitted January 24, 2018. https://arxiv.org/abs/1703.06870.

[2] Lin, Tsung-Yi, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. “Microsoft COCO: Common Objects in Context,” May 1, 2014. https://arxiv.org/abs/1405.0312v3.

See Also

Functions

Objects

Related Topics

External Websites