Main Content

Multi-Object Tracking with DeepSORT

This example shows how to integrate appearance features from a re-Identification (Re-ID) Deep Neural Network with a multi-object tracker to improve the performance of camera-based object tracking. The implementation closely follows the Deep Simple Online and Realtime (DeepSORT) multi-object tracking algorithm [1]. This example uses the Sensor Fusion and Tracking Toolbox™ and the Computer Vision Toolbox™.

Introduction

The objectives of multi-object tracking are to estimate the number of objects in a scene, to accurately estimate their position, and to establish and maintain unique identities for all objects. You often achieve this through a tracking-by-detection approach that consists of two consecutive tasks. First, you obtain the detections of objects in each frame. Second, you perform track the association and management across frames.

This example builds upon the SORT algorithm, introduced in the Implement Simple Online and Realtime Tracking (Sensor Fusion and Tracking Toolbox) example. The data association and track management of SORT is efficient and simple to implement, but it is ineffective when tracking objects over occlusions in single-view camera scenes.

The increasingly popular Re-ID networks provide appearance features, sometimes called appearance embeddings, for each object detection. Appearance features are a representation of the visual appearance of an object. They offer an additional measure of the similarity (or distance) between a detection and a track. The integration of appearance information into the data association is a powerful technique to handle tracking over longer occlusions and therefore reduces the number of switches in track identities.

Assignment Distances

In this section, you learn about the three types of distances that the DeepSORT assignment strategy relies on.

Consider the case depicted in the image below. In the current frame, an object detector gives a detection (Det: 1, in yellow) which should be associated with existing tracks maintained by the multi-object tracker. The tracker hypothesizes that an object with TrackID 1 exist in the current frame, and its estimated bounding box is shown in orange. The track and the detection shown on the image are saved in the associationExampleData MAT-file.

Each distance type may return values in a different range but larger values always indicate that the detection and track are less likely to be of the same object.

load("associationExampleData.mat","newDetection","predictedTrack","frame");

Bounding Box Intersection Over Union

This is the distance metric used in SORT. It formulates a distance between a track and a detection based on the overlap ratio of the two bounding boxes.

distanceIoU=1-AreaofIntersectionAreaofUnion

The output, distanceIoU is a scalar between 0 and 1. Evaluate the intersection-over-union distance using the helperDeepSORT.distanceIoU function.

helperDeepSORT.distanceIoU(predictedTrack, newDetection)
ans = 0.5456

Mahalanobis Distance

Another common approach to evaluate the distance between detections and tracks is the Mahalanobis distance, a statistical distance between probability density functions. It accounts for the uncertainty in the current bounding box location estimate and the uncertainty in the measurement. The distance is given by the following equation

distanceMahalanobis=(z-Hx)TS-1(z-Hx)

z is the bounding box measurement of the detection and x is the track state. H is the Jacobian of the measurement function, which can also be interpreted as the projection from the 8-dimensional state space to the 4-dimensional measurement space in this example. In other words, Hx is the predicted measurement. S is the innovation covariance matrix with the following definition.

S=HPHt+R

where R is the measurement noise covariance.

Evaluate the Mahalanobis distance between the predicted track and the detection.

predictedMeasurement = predictedTrack.State(1:4)' % Same as Hx
predictedMeasurement = 1×4

  990.9279  440.6264    0.3200  174.0941

innovation = newDetection.Measurement-predictedMeasurement % z - Hx
innovation = 1×4

   23.3621    0.5786    0.0774    1.3359

S = predictedTrack.StateCovariance(1:4,1:4) + newDetection.MeasurementNoise % Same as HPH' + R
S = 4×4

    3.5633         0         0         0
         0   43.2935         0         0
         0         0    0.0015         0
         0         0         0  174.1330

Use the helperDeepSORT.distanceMahalanobis function to calculate the distance.

helperDeepSORT.distanceMahalanobis(predictedTrack,newDetection)
ans = 157.1207

The output of distanceMahalanobis is a positive scalar. Unlike, the other two distances, it is not bounded.

Appearance Cosine Distance

This distance metric evaluates the distance between a detection and the predicted track in the appearance feature space.

In DeepSORT [1], each track keeps the history of appearance feature vectors from previous detection assignments. Inspect the Appearance field of the saved track, under the ObjectAttributes property. In this example, appearance vectors are unit vectors with 128 elements. The following predicted track history has 3 vectors.

appearanceHistory = predictedTrack.ObjectAttributes.Appearance
appearanceHistory = 128×3 single matrix

   -0.5418   -0.2732   -0.3913
   -0.4613   -0.5532   -0.6003
   -0.4987   -0.3153   -0.4585
    0.6873    0.9047    0.5020
   -0.1086   -0.1262   -0.2338
   -0.3086   -0.1275   -0.2567
    0.1323    0.0257    0.0728
    0.4070    0.3539    0.3092
   -0.5913   -0.5064   -0.5510
   -0.5432   -0.5954   -0.5659
      ⋮

The distance between two appearance vectors is derived directly from their scalar product.

d=1-appearance1,appearance2appearance1appearance2

With this formula, you can calculate the distance between the appearance vector of a detection and the track history as follows.

detectionAppearance = newDetection.ObjectAttributes.Appearance;
1- (detectionAppearance./vecnorm(detectionAppearance))' *(appearanceHistory./vecnorm(appearanceHistory))
ans = 1×3 single row vector

    0.0460    0.0292    0.0232

Define the appearance cosine distance between a track and a detection as the minimum distance across the history of the track appearance vectors. Use the helperDeepSORT.distanceCosine function to calculate it.

helperDeepSORT.distanceCosine(predictedTrack, newDetection)
ans = single
    0.0232

The appearance cosine distance returns a scalar between 0 and 2.

In this example you use the three distance metrics to formulate the overall assignment problem in terms of cost minimization. You calculate distances for all possible pairs of detections and tracks to form cost matrices.

Matching Cascade

The original idea behind DeepSORT is to combine the Mahalanobis distance and the appearance feature cosine distance to assign a set of new detections to the set of current tracks. The combination is done using a weight parameter λthat has a value between 0 and 1.

Cost=λMahalanobisCost+(1-λ)CosineCost

Both Mahalanobis and the appearance cosine cost matrices are subjected to gating thresholds. Thresholding is done by setting cost matrix elements larger than their respective thresholds to Inf.

Due to the growth of the state covariance for unassigned tracks, the Mahalanobis distance tends to favor tracks that have not been updated in the last few frames over tracks with a smaller prediction error. DeepSORT handles this effect by splitting tracks into groups according to the last frame they were assigned. The algorithm assigns tracks that were updated in the previous frame first. Tracks are assigned to the new detections using linear assignment. Any remaining detections are considered for the assignment with the next track group. Once all track groups have been given a chance to get assigned, the remaining unassigned tracks of unassigned age 1, and the remaining unassigned detections are selected for linear assignment based on their IoU cost matrix. The flowchart below describes the matching cascade.

The helperDeepSORT class implements the assignment routine. You can modify the code and try your own assignment instead.

Pedestrian Tracking Dataset

Download the pedestrian tracking video file.

helperDownloadPedestrianTrackingVideo();

The PedestrianTrackingYOLODetections MAT-file contains detections generated from a YOLO v4 object detector using CSP-DarkNet-53 network and trained on the COCO dataset. See the yolov4ObjectDetector object for more details. The PedestrianTrackingGroundTruth MAT-file contains the ground truth for this video. Refer to the Import Camera-Based Datasets in MOT Challenge Format for Object Tracking (Sensor Fusion and Tracking Toolbox) example to learn how to import the ground truth and detection data into appropriate Sensor Fusion and Tracking Toolbox™ formats.

datasetname="PedestrianTracking";
load(datasetname+"GroundTruth.mat","truths");
load(datasetname+"YOLODetections.mat","detections");

Convert the detections from [xmin,ymin,widht,height] bounding box coordinates to [xcenter,ycenter,aspectratio,height].

detections = helperConvertDeepSORTBoundingBox(detections);

Set the measurement covariance matrix using a standard deviation of 5 pixels in both x and y directions, and a standard deviatin of 10 pixels for the bounding box height. Use 1e-3 for the variance of the bounding box aspect ratio.

R = diag ([25, 25, 1e-3, 100]);
for i=1:numel(detections)
    for j=1:numel(detections{i})
        detections{i}(j).MeasurementNoise = R;
    end
end

Pre-Trained Person Re-Identification Network

Download the re-identification pre-trained network from the internet. Refer to the Reidentify People Throughout a Video Sequence Using ReID Network example to learn about this network and how to train it. You use this pre-trained network to evaluate appearance feature for each detection.

helperDownloadReIDResNet();

Load and initialize the network.

load("personReIDResNet.mat","net");
net = initialize(net);

To obtain the appearance feature vector of a detection, you extract the bounding box coordinates and convert them to image frame indices. You can then crop out the bounding box of the detection and feed the cropped image to the pre-trained network. The network was trained with image of size [128, 64], which you resize the bounding box to.

% Convert bounding box and crop frame
uvah = newDetection.Measurement;
bbox = helperDeepSORT.uvah2tlwh(uvah);
croppedPerson = imcrop(frame, bbox);
imshow(croppedPerson);

croppedPerson = im2single(imresize(croppedPerson,[128,64]));

% Predict Appearance with network
appearanceDLArray = predict(net,255*dlarray(croppedPerson));

% Format as a regular vector and save to the object detection
appearanceVect = extractdata(appearanceDLArray);
appearanceVect = appearanceVect(:)
appearanceVect = 128×1 single column vector

   -0.2321
    0.3099
    2.3477
   -0.7487
   -0.5316
    1.3380
    2.2871
    1.4031
   -1.4003
    1.2088
      ⋮

Use the supporting function runReIDNet to iterate over a set of detections and perform the steps above.

Build DeepSORT Tracker

In this section you construct DeepSORT. The remaining components are the estimation filters, the feature update, and the track initialization and deletion routine. The diagram below gives a summary of all the components involved in tracking-by-detection with DeepSORT.

Estimation filters

As in SORT, the bounding boxes are estimated with a linear Kalman Filter using a constant velocity motion model. The helperInitcvbbkf function shows how the filter initializes using a new detection. Inspect the helperDeepSORT class to find its implementation.

Track Initialization and Deletion

A new track is confirmed if it has been assigned for 3 consecutive frames. An existing track is deleted if it is missed for more than TLost frames. In this example you set TLost=5. This is long enough to account for all the occlusions in the video which has a low frame-rate (1Hz). For videos with higher frame-rate, you should increase this value accordingly.

Appearance Feature Update

For each assigned track, DeepSORT stores the appearance feature vectors of assigned detections up to the value specified in the MaxNumAppearanceFrames property. Use a value of 50 frames. Consider increasing this value for high frame-rate videos.

Finally, use the helperDeepSORT class to build the tracker. The class inherits from the trackerGNN System object and therefore inherits all of its properties.

Set these properties inherited from the trackerGNN System object.

  • ConfirmationThreshold

  • DeletionThreshold

Set ConfirmationThreshold to [3 3] and DeletionThreshold to [TLost TLost] according to the previous discussion on track initialization and deletion.

Set these properties that are specific to the helperDeepSORT class.

  • MaxNumAppearanceFrames

  • AppearanceWeight

  • MahalanobisAssignmentThreshold

  • AppearanceAssignmentThreshold

  • IOUAssignmentThreshold

  • FrameRate

Set IOUAssignmentThreshold to a large value to allow assignment of detections to new tentative tracks. In this video, the low frame-rate, the closeness of the camera to the scene, and the small number of people in the scene lead to few and small overlap between consecutive detections. You can set the threshold to a lower value in videos with higher frame-rate or more crowded scenes.

Next, set the MahalanobisAssignmentThreshold and AppearanceAssignmentThreshold properties. The Mahalanobis distance follows a chi-square distribution. Therefore, draw the threshold from the inverse chi-square distribution for a confidence interval of about 95%. For a 4-dimensional measurement space, the value is 9.4877. Manual tuning leads to an appearance threshold of 0.4.

In [1], setting the AppearanceWeight λto 0 gives better results. In this scene, the combination of the Mahalanobis threshold and the appearance threshold resolves most assignment ambiguities. Therefore, you can choose any value between 0 and 1. For more crowded scenes, consider including some Mahalanobis distance by using non-zero appearance weight as noted in [2]. Set MaxNumAppearanceFrames per previous considerations.

% Configure DeepSORT Tracker
lambda = 0;
Tlost = 5;

tracker = helperDeepSORT(ConfirmationThreshold = [3 3],...
    DeletionThreshold = [Tlost Tlost],...
    MaxNumAppearanceFrames = 50,...
    MahalanobisAssignmentThreshold = 10,...
    AppearanceAssignmentThreshold = 0.4,...
    IOUAssignmentThreshold = 0.95,...
    AppearanceWeight = lambda,...
    FrameRate = 1)
tracker = 
  helperDeepSORT with properties:

                  AppearanceWeight: 0
     AppearanceAssignmentThreshold: 0.4000
    MahalanobisAssignmentThreshold: 10
            IOUAssignmentThreshold: 0.9500
            MaxNumAppearanceFrames: 50
                         FrameRate: 1

           FilterInitializationFcn: @helperInitcvbbkf
                      MaxNumTracks: 100
                  MaxNumDetections: Inf
                     MaxNumSensors: 20

             ConfirmationThreshold: [3 3]
                 DeletionThreshold: [5 5]

                         NumTracks: 0
                NumConfirmedTracks: 0

Evaluate DeepSORT

Next, evaluate the complete tracking workflow on the Pedestrian Tracking video. To use the tracker, call the tracker with an array of objectDetection objects as the input, as if it were a function. The tracker returns confirmed, tentative, and all tracks, and an analysis info structure, similar as the trackerGNN object.

Filter out the YOLO detections with a confidence score lower than 0.5. Delete tracks if their bounding box is entirely out of the camera frame. This is to prevent maintaining tracks that are outside of the camera field of view more than 5 frames.

% Display
reader = VideoReader("PedestrianTrackingVideo.avi");

% Initialize track log
deepSORTTrackLog = objectTrack.empty;

% Set minimum detection score
detectionScoreThreshold = 0.5;

% Processing Loop
for i=1:reader.NumFrames

    % Advance reader
    frame = readFrame(reader);

    % Parse detections set to retrieve detections on the ith frame
    curFrameDetections = detections{i};
    attributes = arrayfun(@(x) x.ObjectAttributes, curFrameDetections);
    scores = arrayfun(@(x) x.Score, attributes);
    highScoreDetections = curFrameDetections(scores > detectionScoreThreshold);

    % Run Re-ID Network on detections
    highScoreDetections = runReIDNet(net, frame, highScoreDetections);

    [tracks, ~, ~, info] = tracker(highScoreDetections);
    
    deleteOutOfFrameTracks(tracker, tracks);

    frame = helperAnnotateDeepSORTTrack(tracks, frame);
    imshow(frame);
    
    % Log tracks for evaluation
    deepSORTTrackLog = [deepSORTTrackLog ; tracks]; %#ok<AGROW>
end

From the results, the person tracked with ID = 3 is occluded multiple times and makes abrupt change of direction. This makes it difficult to track with only motion information by the means of the Mahalanobis distance or bounding box overlap. The use of appearance feature allows to maintain a unique track identifier for this person over this entire sequence and for the rest of the video. This is not achieved with the simpler SORT algorithm or when setting DeepSORT to only use Mahalanobis. You can verify this by setting the AppearanceWeight parameter to 1 and relaxing the appearance threshold by setting ApperanceAssignmentThreshold to 2.

Tracking Metrics

The CLEAR multi-object tracking metrics provide a standard set of tracking metrics to evaluate the quality of tracking algorithm. These metrics are popular for video-based tracking applications. Use the trackCLEARMetrics (Sensor Fusion and Tracking Toolbox) object to evaluate the CLEAR metrics for the two SORT runs.

The CLEAR metrics require a similarity method to match track and true object pairs in each frame. In this example, you use the IoU2d similarity method and set the SimilarityThreshold property to 0.01. This means that a track can only be considered a true positive match with a truth object if their bounding boxes overlap by at least 1%. The metric results can vary depending on the choice of this threshold.

tcm = trackCLEARMetrics(SimilarityMethod ="IoU2d", SimilarityThreshold = 0.01);

The first step is to convert the objectTrack format to the trackCLEARMetrics input format specific to the IoU2d similarity method. Convert the track log.

deepSORTTrackedObjects = repmat(struct("Time",0,"TrackID",1,"BoundingBox", [0 0 0 0]),size(deepSORTTrackLog));
for i=1:numel(deepSORTTrackedObjects)
    deepSORTTrackedObjects(i).Time = deepSORTTrackLog(i).UpdateTime;
    deepSORTTrackedObjects(i).TrackID = deepSORTTrackLog(i).TrackID;
    deepSORTTrackedObjects(i).BoundingBox(:) = helperDeepSORT.getTrackRectangles(deepSORTTrackLog(i))';
end

To evaluate the results on the Pedestrian class only, you only keep ground truth elements with ClassID equal to 1 and filter out other classes.

truths = truths([truths.ClassID]==1);

Use the evaluate object function to obtain the metrics as a table.

deepSORTresults = evaluate(tcm, deepSORTTrackedObjects, truths);
disp(deepSORTresults)
    MOTA (%)    MOTP (%)    Mostly Tracked (%)    Partially Tracked (%)    Mostly Lost (%)    False Positive    False Negative    Recall (%)    Precision (%)    False Track Rate    ID Switches    Fragmentations
    ________    ________    __________________    _____________________    _______________    ______________    ______________    __________    _____________    ________________    ___________    ______________

     84.718      93.097           84.615                 15.385                   0                 31                61            89.867          94.58            0.18343              0               2       

The CLEAR MOT metrics corroborate the quality of DeepSORT in keeping identities of tracks over time with no ID switch and very little fragmentation. This is the main benefit of using DeepSORT over SORT. Meanwhile, maintaining tracks alive over occlusions results in predicted locations being maintained (coasting) and compared against true position, which leads to increased number of false positives and false negatives when the overlap between the coasted tracks and true bounding boxes is less than the metric threshold. This is reflected in the MOTA score of DeepSORT.

Refer to the trackCLEARMetrics (Sensor Fusion and Tracking Toolbox) page for additional information about all the CLEAR metrics quantities.

Note that the matching cascade is the original idea behind DeepSORT to handle the spread of covariance during occlusions. The Mahalanobis distance can be modified to be more robuts to such effects, and a single step assignment can lead to identical or even better performance, as shown in [2].

Conclusion

In this example you have learned how to implement the DeepSORT object tracking algorithm. This is an example of attribute fusion by using deep appearance features for the assignment. The appearance attribute is updated using a simple memory buffer. You also have learned how to integrate a Re-Identification Deep Learning network as part of the tracking-by-detection framework to improve the performance of camera-based tracking in the presence of occlusions.

Supporting Functions

function detections = runReIDNet(net, frame, detections)

if isempty(detections)
    detections = objectDetection.empty;
else
    for j =1:numel(detections)

        % Convert bounding box to tlwh and crop frame
        uvah = detections(j).Measurement;
        bbox = helperDeepSORT.uvah2tlwh(uvah);
        croppedPerson = imcrop(frame,bbox);
        croppedPerson = im2single(imresize(croppedPerson,[128,64]));

        % Predict Appearance with network
        appearanceDLArray = predict(net,255*dlarray(croppedPerson));

        % Format as a regular vector and save to the object detection
        appearanceVect = extractdata(appearanceDLArray);
        appearanceVect = reshape(appearanceVect,[],1);
        detections(j).ObjectAttributes.Appearance = appearanceVect;
    end
end
end

deleteOutOfFrameTracks deletes tracks if their bounding box is entirely out of the video frame.

function deleteOutOfFrameTracks(tracker, confirmedTracks)
% Get bounding boxes in tlwh format
allboxes = helperDeepSORT.getTrackRectangles(confirmedTracks);
allboxes = max(allboxes, realmin);
alloverlaps = bboxOverlapRatio(allboxes,[1,1,1288,964]);
isOutOfFrame = ~alloverlaps;
allTrackIDs = [confirmedTracks.TrackID];
trackToDelete = allTrackIDs(isOutOfFrame);
for i=1:numel(trackToDelete)
    tracker.deleteTrack(trackToDelete(i));
end
end

Reference

[1] Wojke, Nicolai, Alex Bewley, and Dietrich Paulus. "Simple online and realtime tracking with a deep association metric." In 2017 IEEE international conference on image processing (ICIP), pp. 3645-3649.

[2] Du, Yunhao, Zhicheng Zhao, Yang Song, Yanyun Zhao, Fei Su, Tao Gong, and Hongying Meng. "Strongsort: Make deepsort great again." IEEE Transactions on Multimedia (2023).