Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

1. Introduction

기존의 sequential computation은 먼 거리에 있는 input or output sequences와의 dependency를 반영하지 못하였다.
따라서, 본 연구에서는 attention 알고리즘에 기반한 Transformer를 제안한다. → global dependency 반영, allows more parallelization

2. Background

Self-attention: attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.
end-to-end memory network based on recurrent attention mechanism
Transformer: self-attention에만 온전히 의존하여 representation을 계산하는 첫번째 transduction model

3. Model Architecture

encoder와 decoder로 이루어져있다.

<aside> 💡 번역과정과 비슷? 원 문장을 encoder로 representation하고, 생성할 문장을 decoder로 만들어내는 과정 → decoder에서 새로운 걸 만들 때는 그 앞의 단어들만 사용해서 점차적으로 만들어가야하는 masking 개념으로 이해됨.

</aside>
작동방식: X - (encoder) - Z - (decoder) - Y
- 각 step은 auto-regressive (이전 step에서 생성된 generative symbol이 다음 step의 additional input으로 사용된다.)
전반적인 architecture는 stacked self-attention과 point-wise fully connected layers로 이루어진다.

3.1. Encoder and Decoder Stack

Encoder (N=6인 동일한 layer stack)
- multi-head attention + fully feed-forward nework로 이루어진다.
- 두 레이어 다음에는 layer normalization (→ $LayerNorm(x+Sublayer(x)$)과 residual connection (→ $d_{model}=512$)가 존재한다.
<aside> 💡 여기서 $sublayer$은 레이어 자신으로, 레이어를 지나기 전 x값과 레이어를 지난 후 x값을 더하여 normalization을 진행했다는 의미

</aside>
Decoder (N=6인 동일한 layer stack)
- (masked) multi-head attention + (encoder의 결과까지) multi-head attention + fully feed-forward nework
  
  <aside> 💡 decoder에서는 이전의 known output에만 depend하도록 masking
  
  </aside>
- residual connections와 layer normalization도 동일하게 수행한다.

Untitled

3.2. Attention

attention은 query와 key-value를 output으로 만들어내는 과정이다.
output은 query와 key에 의해 계산된 values들의 weighted sum이다.

$A(q,K,V)=\sum_i softmax(f(K,q))V$
- Key, Query, Value: Dictionary와 유사한 방식 (DMQA 유튜브 참고)

3.2.1. Scaled Dot-Product Attention

input은 query($d_k$), key($d_k$), value($d_v$)로 이루어져있다.