Object detection algorithms were mostly constituted of a two part process, the first being identifying the objects in an image and classifying them into classes and the other being locating them in the input image which was done by drawing a bounding box around the identified objects.

Most object detection algorithms prior to RCNNs were introduced were mostly complex ensemble models which worked by combining multiple low-level image features with high-level context.

CNNs were not so popularly used for object detection tasks, most ensemble models used Histogram of Gradients (HOG) And Scale Invariant Feature Transforms (SIFT), which led to a debate in the ILSVRC 2012 workshop on whether the CNN classification results obtained on the ImageNet Challenge were getting translated to the object detections results on the PASCAL VOC challenge.

This paper, was the first to show to bridge this gap between image classification and object detection by showing that CNNs can lead to dramatically higher object detection performance on PASCAL VOC as compared to systems based on simpler HOG-like features. This was done using a two step process

  1. Localizaing objects with a deep network
  2. Training a high-capacity model with only a small quantity of annotated detection data

As stated above, Object detection is a two-step procedure the first being a classification problem in which we try to label the image to its respective class and the other being a localization task where we draw a bounding box around the subject.

To address the localization problem faced by CNNs, the authors propose a “Recognition using Regions” which works both for object detection as well as semantic segmentation. For this, during test time, the authors method goes on to generate 2000 category-independent regions proposals from the input image, extract a fixed-length feature vector from each proposal using a CNN, and then classifies each region with category-specific linear Support Vector Machines (SVMs). The authors use a affine image wrapping technique to compute a fixed-size CNN input for each region proposal, regardless of the region’s shape.

Source: Rich feature hierarchies for accurate object detection and semantic segmentation, Authors: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik

Another problem faced by object detection algorithms, during the time was that there was a scarcity of labelled data which was insufficient for training a large CNN. The authors go on to show that supervised pre-training on a large auxilary dataset such as ImageNet, followed by domain specific fine-tuning on a smaller dataset such as PASCAL VOC, yields an effective paradigm for learning high capacity CNNs when data is scarcly available.

Object Detection using R-CNN

The entire model consists of three modules:

  • First, a category independent region proposal network which define the set of candidate detections available to the detector.
  • Second, a large CNN that extracts fixed-length feature vectors from each region.
  • Third, a set of class specific linear SVMs

Module Designs

Region Proposals

The authors use a selective search to enable controlled comparison for region proposals prior to othe detection works.

Feature Extraction

For the extraction of 4096- dimensional feature vector from each region proposal using AlexNet. Features are computed by forward propagating a mean-subtracted 227 x 227 RGB image through five convolutional layers and two fully connected layers.

In order to computer features for a region proposal, the authors first convert the image data in the region into a form that is compatible with the CNN ( fixed input size of 227 x 227 )

Regardless of the size or aspect ratio of the candidate region, we warp all pixels in a tight bounding box around it to the required size. Prior to warping, we dilate the tight bounding box so that at the warped size there are exactly p pixels of warped image context around the original box (we use p = 16).

Test Time Detection

During test time, the fast mode of selective search is used to extract around 2000 region proposals from an input image, each of these is warped and is forward propagated to the CNN to compute features. Afterwhich, for each class, a score of each extracted feature vector is determined using SVM trained for the specific class. After we have scored all the regions in an image, next is to apply a greedy non-maxima supression ( independently for each class) that rejects a region if it has an intersection-over-union overlap with a higher scoring selected region larger than a learned threshold.

Run-time Analysis

Firstly, all CNN parameters are shared across all classes, which make training time efficient , Secondly, the feature vectors computed by the CNN used are low-dimensional when compared to other approaches like Spatial pyramids which used a bag-of-words encoding. As a result of this, the time spent on computing region proposals and features is amortized for all the classes. Class-specific computations aris only in module 3 where we compute dot products between feature vectors and SVM weights and when applying non-maxima suppression. In practice, all of these dot products for an image are batched into a single matrix-matrix product. This depends on the number of classes being identified as typically the feature matrix is of shape 2000 x 4096 and the SVM weight matrix is 4096 X N, N being the number of classes.

This is particularly important as it goes on to show that RCNNs could go on to be scaled for thousands of classes without resorting to approximation techniques such as hashing. Even for 10k classes, the resulting matrix multiplication could be computing in less than 10 seconds in a modern multi-core CPU.

Training

The CNN was pre-trained on a large auxilary dataset, ImageNet using image-level annotations only as bounding-box labels was not available for this data. Using the Caffee CNN library, the authors were able to replicate the results achieved by Krizhevsky et al, only with a 2.2 % higher top-1 % error rate due to the simplification in training process.

To adapt the CNN to the detection task, the authors continued using Stochastic Gradient Descent (SGD) training taining of the CNN parameters using only the warped region proposals. The authors after removing the ImageNet-specific 1000 way classification layer, with a randomly initialized (N+1)-way classification layer given N classes( 1 more layer is added for labelling the background class) traing the CNN for N=20 on VOC dataset and N=200 for ILSVRC2013(ImageNet) dataset.

We treat all region proposals with ≥ 0.5 IoU overlap with a ground-truth box as positives for that box’s class and the rest as negatives. We start SGD at a learning rate of 0.001 (1/10th of the initial pre-training rate), which allows fine-tuning to make progress while not clobbering the initialization. In each SGD iteration, we uniformly sample 32 positive windows (over all classes) and 96 background windows to construct a mini-batch of size 128. We bias the sampling towards positive windows because they are extremely rare compared to background.

Object Class Classifiers

In a binary classifier, consider a tightly bound image region enclosing a car, it would be considered as a positive example, on the other hand the background region, would be considered as a negative example. It is ambiguous to label regions that partially overlaps a car, this issue is tackled by using an IoU overlap threshold, 0.3 was the selected value using a grid search over values from 0 to 0.5 at intervals of 0.1 each on a validation set. This is particularly important as selecting a wrong threshold could result in decrease in mAP by upto 10-15 points.

Once features are extracted and training labels are applied, we optimize one linear SVM per class. Since the training data is too large to fit in memory, we adopt the standard hard negative mining method. Hard negative mining converges quickly and in practice mAP stops increasing after only a single pass over all images.

Disadvantages of RCNN

RCNNs led to a new way of treating object detection problems, but had a few drawbacks which were worked on further into Fast-RCNN, Faster-RCNN. The main disadvantages were:

 

  • Training had to be done in a multi-staged pipeline
  • Training was expensive in terms of space and time complexity when compared to its successors, convolutional layer sharing , classification in memory For SVM and regressor training, features are extracted from each warped object proposal in each image and written to disk.(VGG16, 5k VOC07 trainval images : 2.5 GPU days). Hundreds of gigabytes of storage.
  • Test time detection is slow → Single Scale Testing, SVD FC layers, At test time, features are extracted from each warped proposal in each image ( VGG 16.57 s/image)