0%

Transformer: Attention is all you need

建议配合 https://www.bilibili.com/video/av56239558?from=search&seid=14406218127146760248 食用。

关于read paper:我个人觉得有两块要读(思考):一块是算法上,有什么interesting或者amazing的算法值得学习。另一块就是经验、思想上,作者是如何提出这个方法的,这个方法背后蕴藏了什么思想/经验。

Background

  • RNN不利于parallelization
  • Attention可以无视sequence距离挖掘信息。(我爱这个笑着眼睛像月牙弯弯每天都像阳光一样点亮我的你-> key information:我爱你)
  • 如今的attention(16年)仍然是和RNN一起出现(used in conjunction with recurrent network)
  • 现在的work:要么不能并行。要么可以并行(例如CNN)却不能挖掘远处关联。(我爱月牙弯弯)
  • Tranformer:第一个完全利用self-attention计算input和output的representation。(不用RNN,CNN)

Include 反义词: preclude 排除,杜绝:This inherently sequential nature precludes parallelization within training examples

albeit:尽管

Model Architecture

Overview:

image-20200426111253973

1
2
3
4
At each step: (sequence to sequence should be generated step by step)
middle sequence z = Encoder(Input sequence x(我爱你))
output sequence y_t = Decoder(z,y_{t-1})
注意:图中右下角的output为之前的输出。

Encoder

  • LayerNorm(x + multi-head(x))
  • LayerNorm(x + feedforward(x))
  • keys,values,querys: input x

Decoder

  • Addtional sublayer compared with Encoder: Masked attention layer(ensures that the predictions for position i can depend only on the known outputs at positions less than i.)

  • Keys, values: middle sequence from encoder

  • querys: Masked multi-head attention

Self-attention

A mapping from a query and a set of key-value pairs to an output

output: a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function(适合性方程?总之就是计算相似度) of the query with the corresponding key.

image-20200426115525168

$Q: [seq,d_k], K:[seq,d_k],V:[seq,d_v]$

$seq$: sequence length

注意 原x dimension 为$d_{model}$, 经过linear projection转化为 $d_k,d_v$ dimension

Multi-head

$x:[d_{model}] => [h *d_v]=>[d_{model}] $

h:head数目。(multihead最后concat起来,所以是$h*d_v$)

再经过一次linear transform到$d_{model}$

Positional Encoding

常见的position方法就是one-hot encoding,但显然不能input dimension匹配,而且若用one-hot encoding 直接sum 未免显得太蠢。

Transformer提出了利用正余弦的方法:

image-20200426140105603

PE即为目标embedding。pos为word所处位置,2i即为 $d_{model}$中的维度之一。

  • PE(:,x)即为$x_{th}$ dimensition下所有position的维度。 可以看到是一个周期函数
  • PE(x,:)即为$x_{th}$ position下的embedding,是一个从波长为2pi 到 10000 * 2pi的函数。

但这里反直觉的是 PE是和input直接相加。而不是concat。。。 奇怪呢。

Dropout -> Residual -> Layer Normalization (Some papers claim that layer normalization should be in the sub-layer instead of behind residual summation)

Why self-attention

  • # of operations
  • Parallel
  • path length between long-range dependencies in the network

Results

  • Dropout 效果明显