This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English verison of the page.

Note: This page has been translated by MathWorks. Please click here
To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

Object Detection Using Faster R-CNN Deep Learning

This example shows how to train an object detector using a deep learning technique named Faster R-CNN (Regions with Convolutional Neural Networks).


This example shows how to train a Faster R-CNN object detector for detecting vehicles. Faster R-CNN [1] is an extension of the R-CNN [2] and Fast R-CNN [3] object detection techniques. All three of these techniques use convolutional neural networks (CNN). The difference between them is how they select regions to process and how those regions are classified. R-CNN and Fast R-CNN use a region proposal algorithm as a pre-processing step before running the CNN. The proposal algorithms are typically techniques such as EdgeBoxes [4] or Selective Search [5], which are independent of the CNN. In the case of Fast R-CNN, the use of these techniques becomes the processing bottleneck compared to running the CNN. Faster R-CNN addresses this issue by implementing the region proposal mechanism using the CNN and thereby making region proposal a part of the CNN training and prediction steps.

In this example, a vehicle detector is trained using the trainFasterRCNNObjectDetector function from Computer Vision System Toolbox™. The example has the following sections:

  • Load the data set.

  • Design the convolutional Neural Network (CNN).

  • Configure training options.

  • Train Faster R-CNN object detector.

  • Evaluate the trained detector.

Note: This example requires Computer Vision System Toolbox™, Image Processing Toolbox™, and Neural Network Toolbox™.

Using a CUDA-capable NVIDIA™ GPU with compute capability 3.0 or higher is highly recommended for running this example. Use of a GPU requires Parallel Computing Toolbox™.

Load Dataset

This example uses a small vehicle data set that contains 295 images. Each image contains 1 to 2 labeled instances of a vehicle. A small data set is useful for exploring the Faster R-CNN training procedure, but in practice, more labeled images are needed to train a robust detector.

% Load vehicle data set
data = load('fasterRCNNVehicleTrainingData.mat');
vehicleDataset = data.vehicleTrainingData;

The training data is stored in a table. The first column contains the path to the image files. The remaining columns contain the ROI labels for vehicles.

% Display first few rows of the data set.
ans =

  4x2 table

          imageFilename             vehicle   
    __________________________    ____________

    'vehicles/image_00001.jpg'    [1x4 double]
    'vehicles/image_00002.jpg'    [1x4 double]
    'vehicles/image_00003.jpg'    [1x4 double]
    'vehicles/image_00004.jpg'    [1x4 double]

Display one of the images from the data set to understand the type of images it contains.

% Add fullpath to the local vehicle data folder.
dataDir = fullfile(toolboxdir('vision'),'visiondata');
vehicleDataset.imageFilename = fullfile(dataDir, vehicleDataset.imageFilename);

% Read one of the images.
I = imread(vehicleDataset.imageFilename{10});

% Insert the ROI labels.
I = insertShape(I, 'Rectangle', vehicleDataset.vehicle{10});

% Resize and display image.
I = imresize(I, 3);

Split the data set into a training set for training the detector, and a test set for evaluating the detector. Select 60% of the data for training. Use the rest for evaluation.

% Split data into a training and test set.
idx = floor(0.6 * height(vehicleDataset));
trainingData = vehicleDataset(1:idx,:);
testData = vehicleDataset(idx:end,:);

Create a Convolutional Neural Network (CNN)

A CNN is the basis of the Faster R-CNN object detector. Create the CNN layer by layer using Neural Network Toolbox™ functionality.

Start with the imageInputLayer function, which defines the type and size of the input layer. For classification tasks, the input size is typically the size of the training images. For detection tasks, the CNN needs to analyze smaller sections of the image, so the input size must be similar in size to the smallest object in the data set. In this data set all the objects are larger than [16 16], so select an input size of [32 32]. This input size is a balance between processing time and the amount of spatial detail the CNN needs to resolve.

% Create image input layer.
inputLayer = imageInputLayer([32 32 3]);

Next, define the middle layers of the network. The middle layers are made up of repeated blocks of convolutional, ReLU (rectified linear units), and pooling layers. These layers form the core building blocks of convolutional neural networks.

% Define the convolutional layer parameters.
filterSize = [3 3];
numFilters = 32;

% Create the middle layers.
middleLayers = [

    convolution2dLayer(filterSize, numFilters, 'Padding', 1)
    convolution2dLayer(filterSize, numFilters, 'Padding', 1)
    maxPooling2dLayer(3, 'Stride',2)


You can create a deeper network by repeating these basic layers. However, to avoid downsampling the data prematurely, keep the number of pooling layers low. Downsampling early in the network discards image information that is useful for learning.

The final layers of a CNN are typically composed of fully connected layers and a softmax loss layer.

finalLayers = [

    % Add a fully connected layer with 64 output neurons. The output size
    % of this layer will be an array with a length of 64.

    % Add a ReLU non-linearity.

    % Add the last fully connected layer. At this point, the network must
    % produce outputs that can be used to measure whether the input image
    % belongs to one of the object classes or background. This measurement
    % is made using the subsequent loss layers.

    % Add the softmax loss layer and classification layer.

Combine the input, middle, and final layers.

layers = [
layers = 

  11x1 Layer array with layers:

     1   ''   Image Input             32x32x3 images with 'zerocenter' normalization
     2   ''   Convolution             32 3x3 convolutions with stride [1  1] and padding [1  1  1  1]
     3   ''   ReLU                    ReLU
     4   ''   Convolution             32 3x3 convolutions with stride [1  1] and padding [1  1  1  1]
     5   ''   ReLU                    ReLU
     6   ''   Max Pooling             3x3 max pooling with stride [2  2] and padding [0  0  0  0]
     7   ''   Fully Connected         64 fully connected layer
     8   ''   ReLU                    ReLU
     9   ''   Fully Connected         2 fully connected layer
    10   ''   Softmax                 softmax
    11   ''   Classification Output   crossentropyex

Configure Training Options

trainFasterRCNNObjectDetector trains the detector in four steps. The first two steps train the region proposal and detection networks used in Faster R-CNN. The final two steps combine the networks from the first two steps such that a single network is created for detection [1]. Each training step can have different convergence rates, so it is beneficial to specify independent training options for each step. To specify the network training options use trainingOptions from Neural Network Toolbox™.

% Options for step 1.
optionsStage1 = trainingOptions('sgdm', ...
    'MaxEpochs', 10, ...
    'InitialLearnRate', 1e-5, ...
    'CheckpointPath', tempdir);

% Options for step 2.
optionsStage2 = trainingOptions('sgdm', ...
    'MaxEpochs', 10, ...
    'InitialLearnRate', 1e-5, ...
    'CheckpointPath', tempdir);

% Options for step 3.
optionsStage3 = trainingOptions('sgdm', ...
    'MaxEpochs', 10, ...
    'InitialLearnRate', 1e-6, ...
    'CheckpointPath', tempdir);

% Options for step 4.
optionsStage4 = trainingOptions('sgdm', ...
    'MaxEpochs', 10, ...
    'InitialLearnRate', 1e-6, ...
    'CheckpointPath', tempdir);

options = [

Here, the learning rate for the first two steps is set higher than the last two steps. Because the last two steps are fine-tuning steps, the network weights can be modified more slowly than in the first two steps.

In addition, 'CheckpointPath' is set to a temporary location for all the training options. This name-value pair enables the saving of partially trained detectors during the training process. If training is interrupted, such as from a power outage or system failure, you can resume training from the saved checkpoint.

Train Faster R-CNN

Now that the CNN and training options are defined, you can train the detector using trainFasterRCNNObjectDetector.

During training, image patches are extracted from the training data. The 'PositiveOverlapRange' and 'NegativeOverlapRange' name-value pairs control which image patches are used for training. Positive training samples are those that overlap with the ground truth boxes by 0.6 to 1.0, as measured by the bounding box intersection over union metric. Negative training samples are those that overlap by 0 to 0.3. The best values for these parameters should be chosen by testing the trained detector on a validation set. To choose the best values for these name-value pairs, test the trained detector on a validation set.

For Faster R-CNN training, the use of a parallel pool of MATLAB workers is highly recommended to reduce training time. trainFasterRCNNObjectDetector automatically creates and uses a parallel pool based on your parallel preference settings. Ensure that the use of the parallel pool is enabled prior to training.

A CUDA-capable NVIDIA™ GPU with compute capability 3.0 or higher is highly recommended for training.

To save time while running this example, a pretrained network is loaded from disk. To train the network yourself, set the doTrainingAndEval variable shown here to true.

% A trained network is loaded from disk to save time when running the
% example. Set this flag to true to train the network.
doTrainingAndEval = false;

if doTrainingAndEval
    % Set random seed to ensure example training reproducibility.

    % Train Faster R-CNN detector. Select a BoxPyramidScale of 1.2 to allow
    % for finer resolution for multiscale object detection.
    detector = trainFasterRCNNObjectDetector(trainingData, layers, options, ...
        'NegativeOverlapRange', [0 0.3], ...
        'PositiveOverlapRange', [0.6 1], ...
        'BoxPyramidScale', 1.2);
    % Load pretrained detector for the example.
    detector = data.detector;

To quickly verify the training, run the detector on a test image.

% Read a test image.
I = imread('highway.png');

% Run the detector.
[bboxes, scores] = detect(detector, I);

% Annotate detections in the image.
I = insertObjectAnnotation(I, 'rectangle', bboxes, scores);

Evaluate Detector Using Test Set

Testing a single image showed promising results. To fully evaluate the detector, testing it on a larger set of images is recommended. Computer Vision System Toolbox™ provides object detector evaluation functions to measure common metrics such as average precision (evaluateDetectionPrecision) and log-average miss rates (evaluateDetectionMissRate). Here, the average precision metric is used. The average precision provides a single number that incorporates the ability of the detector to make correct classifications (precision) and the ability of the detector to find all relevant objects (recall).

The first step for detector evaluation is to collect the detection results by running the detector on the test set. To avoid long evaluation time, the results are loaded from disk. Set the doTrainingAndEval flag from the previous section to true to execute the evaluation locally.

if doTrainingAndEval
    % Run detector on each image in the test set and collect results.
    resultsStruct = struct([]);
    for i = 1:height(testData)

        % Read the image.
        I = imread(testData.imageFilename{i});

        % Run the detector.
        [bboxes, scores, labels] = detect(detector, I);

        % Collect the results.
        resultsStruct(i).Boxes = bboxes;
        resultsStruct(i).Scores = scores;
        resultsStruct(i).Labels = labels;

    % Convert the results into a table.
    results = struct2table(resultsStruct);
    % Load results from disk.
    results = data.results;

% Extract expected bounding box locations from test data.
expectedResults = testData(:, 2:end);

% Evaluate the object detector using Average Precision metric.
[ap, recall, precision] = evaluateDetectionPrecision(results, expectedResults);

The precision/recall (PR) curve highlights how precise a detector is at varying levels of recall. Ideally, the precision would be 1 at all recall levels. In this example, the average precision is 0.6. The use of additional layers in the network can help improve the average precision, but might require additional training data and longer training time.

% Plot precision/recall curve
plot(recall, precision)
grid on
title(sprintf('Average Precision = %.1f', ap))


This example showed how to train a vehicle detector using deep learning. You can follow similar steps to train detectors for traffic signs, pedestrians, or other objects.

Learn more about Deep Learning for Computer Vision.


[1] Ren, Shaoqing, et al. "Faster R-CNN: Towards Real-Time Object detection with Region Proposal Networks." Advances in Neural Information Processing Systems. 2015.

[2] Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014.

[3] Girshick, Ross. "Fast r-cnn." Proceedings of the IEEE International Conference on Computer Vision. 2015.

[4] Zitnick, C. Lawrence, and Piotr Dollar. "Edge boxes: Locating object proposals from edges." European Conference on Computer Vision 2014. Springer International Publishing, 2014. 391-405.

[5] Uijlings, Jasper RR, et al. "Selective search for object recognition." International Journal of Computer Vision (2013): 154-171.