YOLO(YOU ONLY LOOK ONCE)

Rajan Sharma
4 min readMar 21, 2020

--

  1. What is YOLO

YOLO stands for You Only Look Once is an algorithm which detects all the object in a image/frame in a single shot as the name says You Only Look Once means it looks for the image/frame only once and able to detect all the objects in the image/frame. It is an object detection algorithm which involves localization of objects in a image/frame and predicting it’s class to which it belongs to.

2. How it Works

let’s see how YOLO works and how it is able to detect all the object in a single go.

In Machine Learning we need data and corresponding labels that feed to machine so that machine can learn from data and then later we can give only data to machine and we can ask machine what is the label for that data. Same way in YOLO also we have to first train our machine so that later it able to detect all objects in image/frame.

YOLO makes use of only convolutional layers, making it a fully convolutional network (FCN). In YOLO v3 paper, the authors present new, deeper architecture of feature extractor called Darknet-53. As it’s name suggests, it contains of 53 convolutional layers, each followed by batch normalization layer and Leaky ReLU activation. No form of pooling is used, and a convolutional layer with stride 2 is used to downsample the feature maps. This helps in preventing loss of low-level features often attributed to pooling.

3. Input and Output to YOLO Algorithm:

  1. The input is a batch of images.
  2. The output is a list of bounding boxes along with the recognized classes. Each bounding box is represented by 6 numbers (pc,bx,by,bh,bw,c). If we expand c into an 30-dimensional vector, each bounding box is then represented by 35 numbers.

In YOLO, the prediction is done by using a convolutional layer which uses 1 x 1 convolutions. So, the first thing to notice is our output is a feature map. Since we have used 1 x 1 convolutions, the size of the prediction map is exactly the size of the feature map before it. In YOLO v3, the way you interpret this prediction map is that each cell can predict a fixed number of bounding boxes.

For example if we have (B x (5 + C)) entries in the feature map. B represents the number of bounding boxes each cell can predict. According to the paper, each of these B bounding boxes may specialize in detecting a certain kind of object. Each of the bounding boxes have 5 + C attributes, which describe the center coordinates, the dimensions, the objectness score and C class confidences for each bounding box. YOLO v3 predicts 3 bounding boxes for every cell.

4. How YOLO is trained:

Suppose we have an image with corresponding labels. labels consists of bounding box coordinates and class to which it belong to and objectness score.Now we will see how YOLO learn from that, so YOLO divides the image into cells/grid based on the feature map. suppose we have 13x13 feature map. then it will divide the image into 13x13 cells.

In our training image we know where our bounding box is because we have coordinates of bounding box as labels and suppose we have only one object in image. Now in YOLO we divide the image into 13x13 cells. One of the cell contain the center of the bounding box . so that cell is responsible to predict that i have an object. so for that cell objectness score(pc) is 1 because that cell is responsible to contain an object. For other cell objectness score(pc) is 0. In YOLO we have some prededined anchors/bounding boxes which are obtained from COCO dataset. In Yolo v3 we have 3 bounding box with different dimension for each cell. Now we will calculate the bounding box offset so that our bounding box have same dimension with ground truth bounding box.

From above image we can see that image is divided into 13x13 cells and one of the cell is having center of the ground truth bounding box that means that cell is responsible to contain object and corresponding to that cell we have 3 bounding box with different dimension.Now we have to calculate the offset for bounding box so that it can match with the dimension of ground truth bounding box.

How to calculate the offset is involved log-space transformation which i will cover in separate story. But after calculating offset we will use IOU(Intersection Over Union) and Non-Maxima Suppression to pick only one bounding box which is highly overlapped with our ground truth bounding box other bounding box we can ignore.

5. Conclusion

So we saw that How YOLO learn and how it able to detect all the images in a single go. This is not the only algorithm to detect object in a single go there are other algorithms also out there. But YOLO can be used for real-time object detection.

--

--