The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
encoder와 decoder로 이루어져있다.
<aside> 💡 번역과정과 비슷? 원 문장을 encoder로 representation하고, 생성할 문장을 decoder로 만들어내는 과정 → decoder에서 새로운 걸 만들 때는 그 앞의 단어들만 사용해서 점차적으로 만들어가야하는 masking 개념으로 이해됨.
</aside>
작동방식: X - (encoder) - Z - (decoder) - Y
전반적인 architecture는 stacked self-attention과 point-wise fully connected layers로 이루어진다.
Encoder (N=6인 동일한 layer stack)
<aside> 💡 여기서 $sublayer$은 레이어 자신으로, 레이어를 지나기 전 x값과 레이어를 지난 후 x값을 더하여 normalization을 진행했다는 의미
</aside>
Decoder (N=6인 동일한 layer stack)
(masked) multi-head attention + (encoder의 결과까지) multi-head attention + fully feed-forward nework
<aside> 💡 decoder에서는 이전의 known output에만 depend하도록 masking
</aside>
residual connections와 layer normalization도 동일하게 수행한다.
attention은 query와 key-value를 output으로 만들어내는 과정이다.
output은 query와 key에 의해 계산된 values들의 weighted sum이다.
$A(q,K,V)=\sum_i softmax(f(K,q))V$