0%

Mask RCNN

Some Terms

Ground Truth boxes: The masks labeled in the original data.

Paper writing

  • Historically speaking,…
  • conjecture 推测
  • Qualitative results 一些example
  • elucidate 阐发。make something clear

RCNN

Key contributions

  • proposed the impotance of features.

    Features matter, the first sentence of RCNN paper.

  • Generalize the CNN classification results on ImageNet to object detection

    BY

    bridging the gap beween image classification and object detection.

Modules description

generating category-independent region proposals. Use Selective search ( a traditional machine learning method).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Input: image
Return: a set of proposals

Regions Initializer: R = {r1,r2...}
for each ri, rj in R:
cal similarity of i,j: s(ri,rj)
S = S and s(ri,rj)
while S not empy:
get s_max= s(ri,rj)
rt = ri + rj
remove s(r*,r*) including ri and rj
S = S and St
R = R and rt
return bounding box of each region in R

Feature extraction

Proposal transformation
  • tightest square with context (context for fulfill 227,227)(context:前后关系。此处意为proposal再图像里的左右部分 )
  • tightest square without context
  • Resize and zero padding
Train:
  • Positive data: proposals with IOU > 0.5 and Ground Truth labels
  • Negative data: proposals with IOU < 0.5

A CNN model.

Input: $(227,227,3)$

Output features: $(4096)$

Class-specific linear SVMs (class independently)

For each proposal, SVM generates an expected class and corresponding confidence. Final results only include proposals with IoU (Intersection of union) [Bounding box] overlap with a higher scoring selected region larger than a learned threshold (pretty important according to the paper).

Train:

  • Positive data: Ground Truth labels
  • Negative data: proposals with IOU < 0.3
  • Dropped data: proposals with IOU > 0.3 (Too much positive samples which do not emphasize precise localization)

Visualize method

Features

Use 10M+ images as input, for each feature unit, a output value will be generated. Then rank the 10M+ value and show images with corresponding top 10 values. (Speak for themselves)

1523806612856

[Each row indicates the result of each feature unit]

Ablation(切除)
  • Convolution layer has sufficient representational power of image. (更偏于原图)
  • Fc 更偏于 features. The fine-tuning of fc will benefit more because features of different tasks are different.(But the deep representation format of same image has no much difference)

Bounding box regression

Train
  • Positive data: Region proposal with biggest IOU and IOU > 0.6. ($P_x,P_y,P_w,P_h$)

$f(P_x, P_y, P_w, P_h) = (\hat{G_x}, \hat{G_y}, \hat{G_w}, \hat{G_h})$
and
$(\hat{G_x}, \hat{G_y}, \hat{G_w}, \hat{G_h}) \approx (G_x, G_y, G_w, G_h)$

  1. shift. $\hat G_x = P_w d_x(P) + P_x , \text(1).\hat G_y= P_h d_y(P) + P_y , \text(2)$
  2. scale.$\hat G_w= P_w exp(d_w(P) ), \text(3).\hat G_h= P_h exp(d_h(P) ) , \text(4)$
  3. Four real parameters: $d_x(P), d_y(P), d_w(P), d_h(P) =t_*= (t_x, t_y, t_w, t_h) $

$W_* = argmin_{w_*} \sum_i^N(t_*^i - \hat w_*^T\phi_5(P^i))^2 + \lambda || \hat w_*||^2$

$t$ : real changing needing to do.

$w$: learned changing, which need to be regression.

Fast RCNN

Drawbacks of RCNN

  • multi-stage pipeline trainning
  • training is too expensive
  • slow object detection ( proposal and the convolution forward of each proposal is the major time-consuming part )
    • solved by: SPP nets(spatial pyramid pooling networks)
      1. computes a convolutional feature map for the entire input image instead of each proposals.
      2. classifies each proposal using a feature vector extracted from the shared feature map (by max-pooling to a fixed size output such as $6\times6$)

Contributions

  • Single-stage training
  • No disk storage is required for feature caching

Architectures

img

Two differences:

  • feature maps are calculated using proposals, but entire image.
  • Prediction of class id and bbox regression is implemented using one single network. ( instead of SVM + FC)

ROI pooling layer

Any size($16\times20$ for example ) of ROI’s corresponding feature maps will be transformed into fixed size(7*7 for example).

Using a windows of size($16/7\times20/7$) to do max pooling.

backwards calculation

derivatives are accumulated in the input of the ROI pooling layer if it is selected as MAX feature unit.

Scale invariance: to brute force or finesse?

Brute force: fixed size ( single scale)

finesses multi scale

Multi-task Loss

Overall loss = Loss of classification + bounding box regresssion

Typically, The bounding box loss is different!

$L_{bbr} = $

  • $0.5x^2\ if |x|<1, where\ x =(predicted-label) $
  • $|x|-0.5\ otherwise$

To avoid exploding gradient . (Previous $L_{bbr}’ = 2|x|$)

Mini batch sampling

Mini batch size = 128 = 64 RoIs /image * 2images

RoIs = 25% proposals generated AND IoU>0.5

Non Max SUPPRESSION

一组 iou > IOU_THRESHOLD 的proposals。以 iou排序, 选取最大的那个proposal, 计算其他proposals和他的iou,大于 NMS_THRESHOLD 的 全部删掉 (和最好的那个重复太多),一直重复遍历。