Video Object Detection using Faster R-CNN

I implement Ross's work "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks" to realize real-time object detection, which focuses on image-level problem. Here, I extend it to video-level problem by treating videos like a series of frames and also take the relation between each frame into account. Use a tracker to track the video frame by frame and finally visualize the final result


Object detection is an age-old question. Many application need the techniques of object detection, such as IoT, self-driving car. So here we're gonna introduce the state-of-the-art: faster rcnn, achieves high performance and can be used in real-time.

Region-based Convolutional Neural Network

Region-based Convolutional Neural Network, aka R-CNN, is a visual object detection system that combines bottom-up region proposals with features computed by a convolutional neural network.
R-CNN first computes the region proposal with techniques, such as selective search, and feeds the candidates to the convolutional neural network to do the classification task. Here's the system flow of R-CNN:


However, it can some notable disadvantages:
  1. Training is a multi-stage pipeline.(three training stages)
  2. Training is expensive in space. (due to the multi-stage training pipeline)
  3. Object detection is slow. Detection with VGG16 takes 47s / image (on a GPU).
R-CNN is slow because it performs a ConvNet forward pass for each object proposal, without sharing computation. Many works have been proposed to solve this problem, such as spatial pyramid pooling networks(SPPnets), whcih tries to avoid repeatedly computing the convolutional features.
Thus, R-CNN is not good enough for us in application uses though it provides bare enough performance.

Fast R-CNN

"Fast" R-CNN is quite easy to understand by its word. It's quite fast, achieving 0.3s per image for detection when ignoring the region proposal. Well, how does this magic works?
The most important factor is that it shares the computation. After the region proposal, we'll get some bounding boxes. In the previous alogrithm, they just directly feed the warped image to the CNN. That is, if we have 2000 proposals, we have to do 2000 times forward pass, which wasting lots of time. Actually, we can use the relation between these proposals. Many proposals have overlap with others, and these overlap part is fed into the CNN for many times. Maybe we can just compute them for once.
Fast R-CNN utilizes this property well. Here's the illustration of how it really works:


First, we'll feed the whole image into the ConvNet (to conv5). Then, it's where the magic lies in: we know that convolutional layer won't change the spatial relation between the adjacent pixels. That is, the upmost pixel will still falls on the upmost part of the feature map in conv5. Based on this, it's possible for us to porject the coordinates in raw image to the corresponding neuron in conv5! In this way, we can just compute the image through ConvNet once. After getting the faeture for each bounnding box, it will be fed into the RoI pooling layer, which is a special-case of the spatial pyramid pooling layer. The rest work is similar to the previous work.
The following figure is a more clear illustration of Fast R-CNN:


Faster R-CNN

What is Faster R-CNN?
Yes! you're right. It's a faster versino of Fast R-CNN.(The author is quite a straight guy, right?) Faster R-CNN brings the object detection toward the real-time application. It achieves 10 ms per image for object detection including time cost in region proposal. If you know well about Fast R-CNN, Faster-R-CNN won't be too difficult for you. It replace the previous region proposal method with a so-called "Region Proposal Netwrok", aka RPN, making the whole network complete. The Faster R-CNN is look like this:

Left: the overview of the Faster R-CNN network. Right: the region proposal network. Different anchors will apply to the sliding window.

A Region Proposal Network (RPN) takes an image feature map as input and outputs a set of rectangular object proposals, each with an objectness score. Faster R-CNN model this process with a fully convolutional network. The RPN network will detect whether the current region (generated from sliding window and different anchors) is thing or stuff (objectness score) and do the bounding box regression.

How does RPN work?

Usually, the RPN takes image feature map as input(conv5 in Alexnet). And we will apply a 3*3 sliding window on the feature map. Noted that though the window size here is only 3*3, the actual receptive field is quite large if you project the coordinate back to the raw input size. This operation is done by applying an 3*3*256 convolutional kernel on the feature map. In this way, we will get a intermediate layer in 256 dimension. Then the intermediate layer will feed into two different branches, one for objectness score(determines whether the region is thing or stuff) and the other for regression(determines how should the bounding box change to become more similar to the ground truth box).

Some conclusion:

Right now Faster R-CNN provides us a real-time algorithm for object detection. We'll see more devices with this techniques, such as surveillance camera, mobile device. Or we can even extend the object detection on images to object detection on videos, which is more useful for our lifes.

Train Faster R-CNN on your own dataset(eg. ImageNet)

Faster R-CNN is a really cool stuff. You can see my demo video, which is pretty amazin, right. It can even detect the swimming trunk! If you want to train Faster R-CNN on your own dataset, you can refer to my tutorial.
Here is my result on ImageNet, which I train on validation set 1 and test on validation set 2.

I got 33.1% mAP on ImageNet 200 categories


Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
Ross Girshick, Fast R-CNN
Ross Girshick Jeff Donahue Trevor Darrell Jitendra Malik, Rich feature hierarchies for accurate object detection and semantic segmentation