BERT

基於變換器的雙向編碼器表示技術（英語：Bidirectional Encoder Representations from Transformers，BERT）是用於自然語言處理（NLP）的預訓練技術，由Google提出。^[1]^[2]2018年，雅各布·德夫林和同事建立並發佈了BERT。Google正在利用BERT來更好地理解用戶搜尋陳述式的語意。^[3] 2020年的一項文獻調查得出結論：「在一年多一點的時間裏，BERT已經成為NLP實驗中無處不在的基線」，算上分析和改進模型的研究出版物超過150篇。^[4]

最初的英語BERT發佈時提供兩種類型的預訓練模型^[1]：（1）BERT_BASE模型，一個12層，768維，12個自注意頭（self attention head），110M參數的神經網絡結構；（2）BERT_LARGE模型，一個24層，1024維，16個自注意頭，340M參數的神經網絡結構。兩者的訓練語料都是BooksCorpus^[5]以及英語維基百科語料，單詞量分別是8億以及25億。^[6]

結構

BERT的核心部分是一個Transformer模型，其中編碼層數和自注意力頭數量可變。結構與Vaswani等人(2017)^[7]的實現幾乎「完全一致」。

BERT在兩個任務上進行預訓練：語言模型（15%的token被掩蓋，BERT需要從上下文中進行推斷）和下一句預測（BERT需要預測給定的第二個句子是否是第一句的下一句）。訓練完成後，BERT學習到單詞的上下文嵌入。代價昂貴的預訓練完成後，BERT可以使用較少的資源和較小的數據集在下游任務上進行微調，以改進在這些任務上的效能。^[1]^[8]

效能及分析

BERT在以下自然語言理解任務上的效能表現得最為卓越：^[1]

GLUE（General Language Understanding Evaluation，通用語言理解評估）任務集（包括9個任務）。
SQuAD（Stanford Question Answering Dataset，史丹福問答數據集）v1.1和v2.0。
SWAG（Situations With Adversarial Generation，對抗生成的情境）。

有關BERT在上述自然語言理解任務中為何可以達到先進水平，目前還未找到明確的原因^[9]^[10]。目前BERT的可解釋性研究主要集中在研究精心選擇的輸入序列對BERT的輸出的影響關係，^[11]^[12]通過探測分類器分析內部向量表示，^[13]^[14]以及注意力權重表示的關係。^[9]^[10]

歷史

BERT起源於預訓練的上下文表示學習，包括半監督序列學習（Semi-supervised Sequence Learning）^[15]，生成預訓練（Generative Pre-Training），ELMo（英語：ELMo）^[16]和ULMFit^[17]。與之前的模型不同，BERT是一種深度雙向的、無監督的語言表示，且僅使用純文字語料庫進行預訓練的模型。上下文無關模型（如word2vec或GloVe（英語：GloVe））為詞彙表中的每個單詞生成一個詞向量表示，因此容易出現單詞的歧義問題。BERT考慮到單詞出現時的上下文。例如，詞「水分」的word2vec詞向量在「植物需要吸收水分」和「財務報表裏有水分」是相同的，但BERT根據上下文的不同提供不同的詞向量，詞向量與句子表達的句意有關。

2019年10月25日，Google搜尋宣佈他們已經開始在美國國內的英語搜尋查詢中應用BERT模型。^[18]2019年12月9日，據報道，Google搜尋已經在70多種語言的搜尋採用了BERT。^[19] 2020年10月，幾乎每一個基於英語的查詢都由BERT處理。^[20]

獲獎情況

在2019年計算語言學協會北美分會（NAACL（英語：North American Chapter of the Association for Computational Linguistics））年會上，BERT獲得了最佳長篇論文獎。^[21]

參見

Transformer模型
Word2vec
自編碼器
文獻-檢索詞矩陣（英語：Document-term matrix）
特徵提取
特徵學習
神經網絡語言模型（英語：Neural network language model）
向量空間模型
概念向量（英語：Thought vector）
fastText（英語：fastText）
GloVe（英語：GloVe）
TensorFlow

參考文獻

^ ^1.0 ^1.1 ^1.2 ^1.3 Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2018-10-11. arXiv:1810.04805v2  [cs.CL].
^ Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing. Google AI Blog. [2019-11-27]. （原始內容存檔於2021-01-13）（英語）.
^ Understanding searches better than ever before. Google. 2019-10-25 [2019-11-27]. （原始內容存檔於2021-01-27）（英語）.
^ Rogers, Anna; Kovaleva, Olga; Rumshisky, Anna. A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics. 2020, 8: 842–866 [2021-11-24]. doi:10.1162/tacl_a_00349. （原始內容存檔於2022-04-03）.
^ Zhu, Yukun; Kiros, Ryan; Zemel, Rich; Salakhutdinov, Ruslan; Urtasun, Raquel; Torralba, Antonio; Fidler, Sanja. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books: 19–27. 2015. arXiv:1506.06724  [cs.CV]. 缺少或|url=為空 (幫助)
^ Annamoradnejad, Issa. ColBERT: Using BERT Sentence Embedding for Humor Detection. 2020-04-27. arXiv:2004.12765  [cs.CL].
^ Polosukhin, Illia; Kaiser, Lukasz; Gomez, Aidan N.; Jones, Llion; Uszkoreit, Jakob; Parmar, Niki; Shazeer, Noam; Vaswani, Ashish. Attention Is All You Need. 2017-06-12. arXiv:1706.03762  [cs.CL].
^ Horev, Rani. BERT Explained: State of the art language model for NLP. Towards Data Science. 2018 [27 September 2021]. （原始內容存檔於2022-10-17）.
^ ^9.0 ^9.1 Kovaleva, Olga; Romanov, Alexey; Rogers, Anna; Rumshisky, Anna. Revealing the Dark Secrets of BERT. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). November 2019: 4364–4373 [2020-10-19]. doi:10.18653/v1/D19-1445. （原始內容存檔於2020-10-20）（美國英語）.
^ ^10.0 ^10.1 Clark, Kevin; Khandelwal, Urvashi; Levy, Omer; Manning, Christopher D. What Does BERT Look at? An Analysis of BERT's Attention. Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA, USA: Association for Computational Linguistics). 2019: 276–286.
^ Khandelwal, Urvashi; He, He; Qi, Peng; Jurafsky, Dan. Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Stroudsburg, PA, USA: Association for Computational Linguistics). 2018: 284–294. Bibcode:2018arXiv180504623K. arXiv:1805.04623  . doi:10.18653/v1/p18-1027.
^ Gulordava, Kristina; Bojanowski, Piotr; Grave, Edouard; Linzen, Tal; Baroni, Marco. Colorless Green Recurrent Networks Dream Hierarchically. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (Stroudsburg, PA, USA: Association for Computational Linguistics). 2018: 1195–1205. Bibcode:2018arXiv180311138G. arXiv:1803.11138  . doi:10.18653/v1/n18-1108.
^ Giulianelli, Mario; Harding, Jack; Mohnert, Florian; Hupkes, Dieuwke; Zuidema, Willem. Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA, USA: Association for Computational Linguistics). 2018: 240–248. Bibcode:2018arXiv180808079G. arXiv:1808.08079  . doi:10.18653/v1/w18-5426.
^ Zhang, Kelly; Bowman, Samuel. Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA, USA: Association for Computational Linguistics). 2018: 359–361. doi:10.18653/v1/w18-5448.
^ Dai, Andrew; Le, Quoc. Semi-supervised Sequence Learning. 2015-11-04. arXiv:1511.01432  [cs.LG].
^ Peters, Matthew; Neumann, Mark; Iyyer, Mohit; Gardner, Matt; Clark, Christopher; Lee, Kenton; Luke, Zettlemoyer. Deep contextualized word representations. 2018-02-15. arXiv:1802.05365v2  [cs.CL].
^ Howard, Jeremy; Ruder, Sebastian. Universal Language Model Fine-tuning for Text Classification. 2018-01-18. arXiv:1801.06146v5  [cs.CL].
^ Nayak, Pandu. Understanding searches better than ever before. Google Blog. 2019-10-25 [2019-12-10]. （原始內容存檔於2019-12-05）.
^ Montti, Roger. Google's BERT Rolls Out Worldwide. Search Engine Journal. Search Engine Journal. 2019-12-10 [2019-12-10]. （原始內容存檔於2020-11-29）.
^ Google: BERT now used on almost every English query. Search Engine Land. 2020-10-15 [2020-11-24]. （原始內容存檔於2022-05-06）.
^ Best Paper Awards. NAACL. 2019 [2020-03-28]. （原始內容存檔於2020-10-19）.

外部連結

官方GitHub倉庫（頁面存檔備份，存於互聯網檔案館）

[:0-1] 1.0 ^1.1 ^1.2 ^1.3 Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2018-10-11. arXiv:1810.04805v2  [cs.CL].

[2] Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing. Google AI Blog. [2019-11-27]. （原始內容存檔於2021-01-13）（英語）.

[3] Understanding searches better than ever before. Google. 2019-10-25 [2019-11-27]. （原始內容存檔於2021-01-27）（英語）.

[4] Rogers, Anna; Kovaleva, Olga; Rumshisky, Anna. A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics. 2020, 8: 842–866 [2021-11-24]. doi:10.1162/tacl_a_00349. （原始內容存檔於2022-04-03）.

[5] Zhu, Yukun; Kiros, Ryan; Zemel, Rich; Salakhutdinov, Ruslan; Urtasun, Raquel; Torralba, Antonio; Fidler, Sanja. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books: 19–27. 2015. arXiv:1506.06724  [cs.CV]. 缺少或|url=為空 (幫助)

[6] Annamoradnejad, Issa. ColBERT: Using BERT Sentence Embedding for Humor Detection. 2020-04-27. arXiv:2004.12765  [cs.CL].

[vaswani-7] Polosukhin, Illia; Kaiser, Lukasz; Gomez, Aidan N.; Jones, Llion; Uszkoreit, Jakob; Parmar, Niki; Shazeer, Noam; Vaswani, Ashish. Attention Is All You Need. 2017-06-12. arXiv:1706.03762  [cs.CL].

[8] Horev, Rani. BERT Explained: State of the art language model for NLP. Towards Data Science. 2018 [27 September 2021]. （原始內容存檔於2022-10-17）.

[:1-9] 9.0 ^9.1 Kovaleva, Olga; Romanov, Alexey; Rogers, Anna; Rumshisky, Anna. Revealing the Dark Secrets of BERT. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). November 2019: 4364–4373 [2020-10-19]. doi:10.18653/v1/D19-1445. （原始內容存檔於2020-10-20）（美國英語）.

[:2-10] 10.0 ^10.1 Clark, Kevin; Khandelwal, Urvashi; Levy, Omer; Manning, Christopher D. What Does BERT Look at? An Analysis of BERT's Attention. Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA, USA: Association for Computational Linguistics). 2019: 276–286.

[11] Khandelwal, Urvashi; He, He; Qi, Peng; Jurafsky, Dan. Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Stroudsburg, PA, USA: Association for Computational Linguistics). 2018: 284–294. Bibcode:2018arXiv180504623K. arXiv:1805.04623  . doi:10.18653/v1/p18-1027.

[12] Gulordava, Kristina; Bojanowski, Piotr; Grave, Edouard; Linzen, Tal; Baroni, Marco. Colorless Green Recurrent Networks Dream Hierarchically. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (Stroudsburg, PA, USA: Association for Computational Linguistics). 2018: 1195–1205. Bibcode:2018arXiv180311138G. arXiv:1803.11138  . doi:10.18653/v1/n18-1108.

[13] Giulianelli, Mario; Harding, Jack; Mohnert, Florian; Hupkes, Dieuwke; Zuidema, Willem. Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA, USA: Association for Computational Linguistics). 2018: 240–248. Bibcode:2018arXiv180808079G. arXiv:1808.08079  . doi:10.18653/v1/w18-5426.

[14] Zhang, Kelly; Bowman, Samuel. Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA, USA: Association for Computational Linguistics). 2018: 359–361. doi:10.18653/v1/w18-5448.

[15] Dai, Andrew; Le, Quoc. Semi-supervised Sequence Learning. 2015-11-04. arXiv:1511.01432  [cs.LG].

[16] Peters, Matthew; Neumann, Mark; Iyyer, Mohit; Gardner, Matt; Clark, Christopher; Lee, Kenton; Luke, Zettlemoyer. Deep contextualized word representations. 2018-02-15. arXiv:1802.05365v2  [cs.CL].

[17] Howard, Jeremy; Ruder, Sebastian. Universal Language Model Fine-tuning for Text Classification. 2018-01-18. arXiv:1801.06146v5  [cs.CL].

[18] Nayak, Pandu. Understanding searches better than ever before. Google Blog. 2019-10-25 [2019-12-10]. （原始內容存檔於2019-12-05）.

[19] Montti, Roger. Google's BERT Rolls Out Worldwide. Search Engine Journal. Search Engine Journal. 2019-12-10 [2019-12-10]. （原始內容存檔於2020-11-29）.

[20] Google: BERT now used on almost every English query. Search Engine Land. 2020-10-15 [2020-11-24]. （原始內容存檔於2022-05-06）.

[21] Best Paper Awards. NAACL. 2019 [2020-03-28]. （原始內容存檔於2020-10-19）.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]