Originally proposed in 2015, the YOLO architecture is one of the state-of-the-art approaches to object detection both in terms of accuracy and inference speed. Since its introduction, there have been many improvements upon the initial model over time. The interesting thing about YOLO is its change of the object detection task from classification to regression, making OD faster.
In this topic, we will look at the main components of the first YOLO and discuss some of the later developments.
The setup
Before going into the architectural specifics of the first YOLO model, we have to look back.
The pre-YOLO approach to the task of object detection was formulated as a classification problem, where various regions of the input image were fed into a classifier to determine whether a certain object is present in the image. Think of a sliding window that shifts evenly through the whole image and takes a specific fragment at every time step to run a classifier (more on that here). The issue is that it's a tedious process that wastes a lot of resources for nothing (we actually don't know if there is an object in the first place, and where exactly it is).
Then, R-CNN comes around in 2013, which generates potential regions of interest first, and then those regions are run through a classifier (and some post-processing is performed later). The described pipelines are pretty cumbersome. What if instead, object detection gets re-formulated as a single regression problem, where an image is fed into a CNN and assigned a vector of class probabilities and bounding box coordinates? And that's what You Only Look Once essentially does by going straight from the pixels to a single output, in one CNN pass.
Preliminary components
First, the input image is divided into an grid cell, and if the center of an object falls into a particular cell, that cell is responsible for detecting the object. Suppose we have a very synthetic task of predicting whether there is a billboard or an anime girl in the image (thus, having 2 classes).
Each cell, in turn, predicts bounding boxes and their confidence scores. The confidence score shows how confident the model is that the bounding box a) contains an object and b) how accurate the box it predicts is. Confidence is defined as
If there is no object in the cell, the confidence score is 0. Otherwise, the confidence score is the intersection over union (IOU) between the predicted bounding box and the ground truth. Intersection over union (IOU) can be illustrated as follows:
Each bounding box has the following predictions: confidence (defined previously), x, y, w, h, and C conditional probabilities.
(x,y) is the center of the box relative to the bounds of the grid cell. The width (w) and height (h) are a percent of the cell's height (or width). The conditional class probabilities are conditioned on the cell containing an object. There is only one set of conditional probabilities per grid cell. In the illustration below, :
What happens if there are multiple boxes present in a single grid cell? The predictions are stacked, and there are 2 sets of (confidence, x, y, w, h), and only a single set of C.
Non-maximum suppression (NMS) is also used in the post-processing stage, after the network makes a forward pass through the image but before the final detections are confirmed. YOLO predicts multiple bounding boxes per grid cell. These bounding boxes vary in positions and in size. Not all boxes are valid object detections - some are false positives. NMS works by comparing the scores of all bounding boxes and suppressing the boxes whose scores are below a certain threshold or have significant overlap in the area (intersection over union) with higher-ranking boxes. It keeps the bounding box with the highest score as the final prediction for a certain object.
The architecture
The sequence of layers for YOLO is pretty standard for a CNN and was inspired by GoogLeNet, but does not include the Inception modules. Instead of the Inception modules, there are reduction layers followed by convolutions. There are 24 CONV layers in total and 2 FC layers. The first FC layers are responsible for feature extraction, and the later FCs predict the coordinates and the probabilities. The first 20 layers are pretrained on ImageNet. The loss is the sum-square error (due to the ease of optimization), but it is not optimal to maximize average precision. The loss only penalizes classification error if an object is present in that grid cell and also only penalizes bounding box coordinate error if that predictor is "responsible" for the ground truth box. It also treats detection and localization errors the same way, which is not desirable. The number of grid cells (S) is 7, there are 2 bounding boxes for each grid (B), and the number of classes (C) is 20. The output is a tensor (since there are 7 grid cells, two bounding boxes, and 20 classes, the vector is .
YOLOv1 has certain limitations. Since there are only two bounding boxes and only one class for each grid cell, strong spatial limitations are imposed (limiting the number of nearby objects that can be detected, thus, small objects in large groups, e.g., birds, will be poorly detected). Another problem is detecting objects in unusual ratios (different from the ones that were present in the training set). The last issue comes down to the fact that the errors in large bounding boxes are the same as the errors in the small bounding boxes (a small error in a small bounding box has a great effect on the intersection over union, while a small error in a large bounding box barely has an effect). The main source of errors is localization.
In the next section, we will take a look at the further developments that address some of the aforementioned issues.
YOLO model zoo
In this section, we will do a brief version overview, but it's important to note that not all models are included (people really love YOLO, and there are simply too many of them). Sometimes there was no official paper released (i.e., YOLOv5), sometimes the application is too specific (i.e., YOLOv6, which deals with industrial applications), and sometimes we would have to delve way too deep into the context, which is not possible in the current scope. As a side note, YOLOv3 has a very entertaining paper that we highly recommend checking out.
At the time of writing, the current SOTA is YOLO-NAS.
Version | New features |
Anchor boxes, addition of BatchNorm, introduction of Darknet (9000 classes instead of previous 20), multi-scale training, direct location prediction | |
So here’s the deal with YOLOv3: We mostly took good ideas from other people. We also trained a new classifier network that’s better than the other ones. (source: arxiv link) | |
Bag of freebies (BoF), bag of specials (BoS), the CIOU loss |
YOLOv2 was not optimal for detecting objects at various scales. Anchor boxes are predefined boxes with certain heights and widths and are used to detect objects of different shapes and sizes. They provide a method to predict the bounding boxes by aligning the detected objects with the anchor boxes that best match them. The dimensions of the anchor boxes are determined through a clustering algorithm on the training set. Another improvement is multi-scale training. During the training process, the network is randomly resized every few iterations, which enables the model to predict objects of various scales and sizes. CIOU (Complete intersection over union) loss from v4 provides better localization.
Conclusion
As a result, you are familiar with the main components of YOLO, the general architecture, some of the initial limitations of the model, and a few features that were introduced later on to address the limitations.