Mask RCNN

Some Terms

Ground Truth boxes: The masks labeled in the original data.

Paper writing

Historically speaking,…
conjecture 推测
Qualitative results 一些example
elucidate 阐发。make something clear

RCNN

Key contributions

proposed the impotance of features.

Features matter, the first sentence of RCNN paper.
Generalize the CNN classification results on ImageNet to object detection

BY

bridging the gap beween image classification and object detection.

Modules description

Region proposals generation: Selective search

generating category-independent region proposals. Use Selective search ( a traditional machine learning method).

Input: image
Return: a set of proposals

Regions Initializer: R = {r1,r2...} 
for each ri, rj in R:
	cal similarity of i,j: s(ri,rj)
     S = S and s(ri,rj)
while S not empy:
    get s_max=  s(ri,rj)
    rt = ri + rj
    remove s(r*,r*) including ri and rj
    S = S and St
    R = R and rt
return bounding box of each region in R

Feature extraction

Proposal transformation

tightest square with context (context for fulfill 227,227)(context:前后关系。此处意为proposal再图像里的左右部分 )
tightest square without context
Resize and zero padding

Train:

Positive data: proposals with IOU > 0.5 and Ground Truth labels
Negative data: proposals with IOU < 0.5

A CNN model.

Input: $(227,227,3)$

Output features: $(4096)$

Class-specific linear SVMs (class independently)

For each proposal, SVM generates an expected class and corresponding confidence. Final results only include proposals with IoU (Intersection of union) [Bounding box] overlap with a higher scoring selected region larger than a learned threshold (pretty important according to the paper).

Train:

Positive data: Ground Truth labels
Negative data: proposals with IOU < 0.3
Dropped data: proposals with IOU > 0.3 (Too much positive samples which do not emphasize precise localization)

Visualize method

Features

Use 10M+ images as input, for each feature unit, a output value will be generated. Then rank the 10M+ value and show images with corresponding top 10 values. (Speak for themselves)

[Each row indicates the result of each feature unit]

Ablation(切除)

Convolution layer has sufficient representational power of image. (更偏于原图)
Fc 更偏于 features. The fine-tuning of fc will benefit more because features of different tasks are different.(But the deep representation format of same image has no much difference)

Bounding box regression

Train

Positive data: Region proposal with biggest IOU and IOU > 0.6. ($P_x,P_y,P_w,P_h$)

$f(P_x, P_y, P_w, P_h) = (\hat{G_x}, \hat{G_y}, \hat{G_w}, \hat{G_h})$
and
$(\hat{G_x}, \hat{G_y}, \hat{G_w}, \hat{G_h}) \approx (G_x, G_y, G_w, G_h)$

shift. $\hat G_x = P_w d_x(P) + P_x , \text(1).\hat G_y= P_h d_y(P) + P_y , \text(2)$
scale.$\hat G_w= P_w exp(d_w(P) ), \text(3).\hat G_h= P_h exp(d_h(P) ) , \text(4)$
Four real parameters: $d_x(P), d_y(P), d_w(P), d_h(P) =t_*= (t_x, t_y, t_w, t_h) $

$W_* = argmin_{w_*} \sum_i^N(t_*^i - \hat w_*^T\phi_5(P^i))^2 + \lambda || \hat w_*||^2$

$t$ : real changing needing to do.

$w$: learned changing, which need to be regression.

Fast RCNN

Drawbacks of RCNN

multi-stage pipeline trainning
training is too expensive
slow object detection ( proposal and the convolution forward of each proposal is the major time-consuming part )
- solved by: SPP nets(spatial pyramid pooling networks)
  1. computes a convolutional feature map for the entire input image instead of each proposals.
  2. classifies each proposal using a feature vector extracted from the shared feature map (by max-pooling to a fixed size output such as $6\times6$)

Contributions

Single-stage training
No disk storage is required for feature caching

Architectures

Two differences:

feature maps are calculated using proposals, but entire image.
Prediction of class id and bbox regression is implemented using one single network. ( instead of SVM + FC)

ROI pooling layer

Any size($16\times20$ for example ) of ROI’s corresponding feature maps will be transformed into fixed size(7*7 for example).

Using a windows of size($16/7\times20/7$) to do max pooling.

backwards calculation

derivatives are accumulated in the input of the ROI pooling layer if it is selected as MAX feature unit.

Scale invariance: to brute force or finesse?

Brute force: fixed size ( single scale)

finesses multi scale

Multi-task Loss

Overall loss = Loss of classification + bounding box regresssion

Typically, The bounding box loss is different!

$L_{bbr} = $

$0.5x^2\ if |x|<1, where\ x =(predicted-label) $
$|x|-0.5\ otherwise$

To avoid exploding gradient . (Previous $L_{bbr}’ = 2|x|$)

Mini batch sampling

Mini batch size = 128 = 64 RoIs /image * 2images

RoIs = 25% proposals generated AND IoU>0.5

Non Max SUPPRESSION

一组 iou > IOU_THRESHOLD 的proposals。以 iou排序，选取最大的那个proposal，计算其他proposals和他的iou，大于 NMS_THRESHOLD 的全部删掉（和最好的那个重复太多），一直重复遍历。