The you-only-look-once (YOLO) v3 object detector is a multi-scale object detection network that uses a feature extraction network and multiple detection heads to make predictions at multiple scales.
The YOLO v3 object detection model runs a deep learning convolutional neural network (CNN) on an input image to produce network predictions from multiple feature maps. The object detector gathers and decodes predictions to generate the bounding boxes.
YOLO v3 uses anchor boxes to detect classes of objects in an image. For more details, see Anchor Boxes for Object Detection.The YOLO v3 predicts these three attributes for each anchor box:
Intersection over union (IoU) — Predicts the objectness score of each anchor box.
Anchor box offsets — Refine the anchor box position
Class probability — Predicts the class label assigned to each anchor box.
The figure shows predefined anchor boxes (the dotted lines) at each location in a feature map and the refined location after offsets are applied. Matched boxes with a class are in color.
To design a YOLO v3 object detection network, follow these steps.
Start the model with a feature extraction network. The feature extraction network serves as the base network for creating the YOLO v3 deep learning network. The base network can be a pretrained or untrained CNN. If the base network is a pretrained network, you can perform transfer learning.
Create detection subnetworks by using convolution, batch normalization, and ReLu layers. Add the detection subnetworks to any of the layers in the base network. The output layers that connect as inputs to the detection subnetworks are the detection network source. Any layer from the feature extraction network can be used as a detection network source. To use multiscale features for object detection, choose feature maps of different sizes.
To manually create a YOLO v3 deep learning network, use the Deep Network
Designer (Deep Learning Toolbox) app. To programmatically create a YOLO v3 deep learning network, use
To perform transfer learning, you can use a pretrained deep learning network as the
base network for YOLO v3 deep learning network. Configure the YOLO v3 deep learning for
training on a new dataset by specifying the anchor boxes and the new object classes. Use
yolov3ObjectDetector object to create a YOLO v3 detection network from any
pretrained CNN, like
SqueezeNet and perform transfer learning. For a
list of pretrained CNNs, see Pretrained Deep Neural Networks (Deep Learning Toolbox).
To learn how to create a custom YOLO v3 object detector by using a deep learning network as base network and train for object detection, see the Object Detection Using YOLO v3 Deep Learning example.
You can use the Image Labeler,
or Ground Truth Labeler (Automated Driving Toolbox) apps to interactively
label pixels and export label data for training. The apps can also be used to label
rectangular regions of interest (ROIs) for object detection, scene labels for image
classification, and pixels for semantic segmentation. To create training data from any
of the labelers exported ground truth object, you can use the
pixelLabelTrainingData functions. For more details, see Training Data for Object Detection and Semantic Segmentation.
 Redmon, Joseph, and Ali Farhadi. “YOLO9000: Better, Faster, Stronger.” In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6517–25. Honolulu, HI: IEEE, 2017. https://doi.org/10.1109/CVPR.2017.690.
 Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. "You only look once: Unified, real-time object detection." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 779–788. Las Vegas, NV: CVPR, 2016.