Google에서 만든 언어 모델 BERT에 관한 논문 Bert: Pre-training of deep bidirectional transformers for language understanding을 정리하였습니다. 해당 연구에서 제시한 BERT는 bidirectional하게 모델을 학습하며, 다양한 task에 적용하기 용이하다.

Abstract

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

1. Introduction

pre-trained language representation을 적용할 수 있는 task는 feature-based와 fine-tuning이 있다.
- Feature-based approach: ELMo와 같이, 특정 task에 적합한 구조에 pre-trained representation을 추가적인 feature로 사용하는 경우이다.
- Fine-tuning approach: GPT와 같이, task와 관련된 parameter는 적고 fine-tuning을 통해 task에 적용하는 경우이다.
이 두 가지 모두 unidirectional language model을 사용하여 general language representation을 학습한다.
- 따라서, GPT의 경우 question answering 같은 token-'level task에서 pre-trained representation의 power가 제한된다. (question answering 같은 경우 앞 뒤 문맥을 다 살펴야 하기 때문)
해당 논문은 BERT (Bidirectional Encoder Representations from Transformers)를 제안한다.
- BERT는 masked language model (MLM) objective를 사용하여 앞 뒤 문맥을 다 고려한 representation 생성이 가능하게 한다.
- next sentence prediction task도 사용하여 text-pair representation 또한 진행한다.
해당 논문의 기여점은 다음과 같다.
1. importance of bidirectional pre-training을 입증
2. pre-trained representation이 task-specific architecture의 필요성을 줄인다는 것을 확인
3. 11개 NLP task에서 SOTA 달성

이전 연구들은 대부분 left-to-right (unidirectional) 방식이었다.
ELMo의 경우 left-to-right, right-to-left language model을 사용하여 context-sensitive feature을 추출하였다. (feature-based)
- 이는 단순히 양 방향에서의 representation을 concatenate한 것으로, BERT에서 하고자 하는, deeply bidirectional은 아니다.

large dataset에 대해 supervised task를 처리하기 위해서는 pre-trained 모델을 fine-tune 하는 것이 effective하다.

framework는 두 가지로 구분된다: pre-training (on unlabeled data) and fine-tuning (pre-trained parameter로 시작하여, labeled data에 fine-tune)
특징적인 점은, BERT는 여러 tasks에 대해 통일된 architecture를 제공한다는 것이다. (task에 따른 구조 변경이 적은 편이다.)