基於人類反饋的強化學習

基於人類反饋的強化學習（英語：reinforcement learning from human feedback，簡稱RLHF），包括基於人類偏好的強化學習（reinforcement learning from human preferences），是一種直接根據人類反饋訓練「獎勵模型」的機器學習技術，並使用該模型作為強化學習中的獎勵函數，再通過近端策略最佳化等演算法以最佳化智能體（agent）策略。^[1]獎勵模型在進行策略最佳化之前預先訓練，以預測給定的輸出是好（高獎勵）還是壞（低獎勵）。RLHF可以提高強化學習智能體的穩健性（robustness）和探索性（exploration），尤其適用於獎勵函數稀疏或有噪聲（不確定性）的情形。^[2]

人類反饋最常見的收集方式是要求人類對智能體行為的實例進行偏好排序。^[3]^[4]^[5]之後可以通過Elo等級分等方式利用排序結果對輸出進行評分。^[1]雖然這種偏好判斷被廣泛採用，但還有其他類型的人類反饋可以提供更豐富的資訊，例如數字反饋、自然語言反饋和編輯率等。

標準RLHF假設人類偏好遵循成對比較的布拉德利-特里模型（英語：Bradley–Terry model）或者多重比較的普拉斯基特-盧斯模型（Plackett–Luce model），並通過最小化交叉熵損失以學習獎勵模型。^[6]在訓練完獎勵模型之後，RLHF根據學習到的獎勵模型對語言模型進行進一步微調，使模型與人類偏好保持一致。

RLHF適用於模型輸出的質素難以用演算法清晰定義；但人類可以輕鬆判斷的任務。例如，如果模型的任務是生成一個引人入勝的故事，人類可以對人工智能生成的不同故事的質素進行評分，而模型可以利用人類的反饋來提高其生成新故事的能力。

RLHF已應用於自然語言處理的各個領域，例如對話、文字摘要和自然語言理解。在普通的強化學習中，智能體根據「獎勵函數」從自己的行為中學習。但在自然語言處理任務中，獎勵通常不容易定義或測量，特別是在處理涉及人類價值觀或偏好的複雜任務時尤其如此。在RLHF的幫助下，語言模型能夠提供與這些複雜價值觀相符的答案，生成更為詳細的回覆，同時拒絕不適當或超出模型知識空間的問題。^[7] 經RLHF訓練的語言模型包括OpenAI開發的ChatGPT及其前身InstructGPT^[4]、DeepMind的Sparrow等。

除自然語言處理外，RLHF還被應用於電動遊戲機械人開發等其他領域。例如，OpenAI和DeepMind訓練的智能體能基於人類喜好來玩Atari遊戲。^[8]^[9]這些智能體在多種測試環境中都表現出色，經常能超越人類的水準。^[10]

參考文獻

^ ^1.0 ^1.1 Lambert, Nathan; Castricato, Louis; von Werra, Leandro; Havrilla, Alex. Illustrating Reinforcement Learning from Human Feedback (RLHF). huggingface.co. [4 March 2023]. （原始內容存檔於2023-03-16）.
^ MacGlashan, James; Ho, Mark K; Loftin, Robert; Peng, Bei; Wang, Guan; Roberts, David L.; Taylor, Matthew E.; Littman, Michael L. Interactive learning from policy-dependent human feedback. Proceedings of the 34th International Conference on Machine Learning - Volume 70 (JMLR.org). 6 August 2017: 2285–2294 [2023-12-11]. arXiv:1701.06049  . （原始內容存檔於2023-03-04）.
^ Ouyang, Long; Wu, Jeffrey; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina. Training language models to follow instructions with human feedback. Thirty-Sixth Conference on Neural Information Processing Systems: NeurIPS 2022. 31 October 2022 [2023-12-11]. arXiv:2203.02155  . （原始內容存檔於2023-03-15）（英語）.
^ ^4.0 ^4.1 Edwards, Benj. OpenAI invites everyone to test ChatGPT, a new AI-powered chatbot—with amusing results. Ars Technica. 1 December 2022 [4 March 2023]. （原始內容存檔於2023-03-15）（美國英語）.
^ Abhishek, Gupta. Getting stakeholder engagement right in responsible AI. VentureBeat. 5 February 2023 [4 March 2023]. （原始內容存檔於2023-03-20）.
^ Zhu, Banghua; Jordan, Michael; Jiao, Jiantao. Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons. Proceedings of the 40th International Conference on Machine Learning (PMLR). 2023-07-03: 43037–43067 [2023-12-11]. （原始內容存檔於2023-10-27）（英語）.
^ Wiggers, Kyle. Can AI really be protected from text-based attacks?. TechCrunch. 24 February 2023 [4 March 2023]. （原始內容存檔於2023-03-16）.
^ Learning from human preferences. openai.com. [4 March 2023]. （原始內容存檔於2023-06-18）.
^ Learning through human feedback. www.deepmind.com. [4 March 2023]. （原始內容存檔於2023-03-19）（英語）.
^ Christiano, Paul F; Leike, Jan; Brown, Tom; Martic, Miljan; Legg, Shane; Amodei, Dario. Deep Reinforcement Learning from Human Preferences. Advances in Neural Information Processing Systems (Curran Associates, Inc.). 2017, 30 [4 March 2023]. （原始內容存檔於2023-03-19）.

[huggingface-1] 1.0 ^1.1 Lambert, Nathan; Castricato, Louis; von Werra, Leandro; Havrilla, Alex. Illustrating Reinforcement Learning from Human Feedback (RLHF). huggingface.co. [4 March 2023]. （原始內容存檔於2023-03-16）.

[2] MacGlashan, James; Ho, Mark K; Loftin, Robert; Peng, Bei; Wang, Guan; Roberts, David L.; Taylor, Matthew E.; Littman, Michael L. Interactive learning from policy-dependent human feedback. Proceedings of the 34th International Conference on Machine Learning - Volume 70 (JMLR.org). 6 August 2017: 2285–2294 [2023-12-11]. arXiv:1701.06049  . （原始內容存檔於2023-03-04）.

[3] Ouyang, Long; Wu, Jeffrey; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina. Training language models to follow instructions with human feedback. Thirty-Sixth Conference on Neural Information Processing Systems: NeurIPS 2022. 31 October 2022 [2023-12-11]. arXiv:2203.02155  . （原始內容存檔於2023-03-15）（英語）.

[ars-4] 4.0 ^4.1 Edwards, Benj. OpenAI invites everyone to test ChatGPT, a new AI-powered chatbot—with amusing results. Ars Technica. 1 December 2022 [4 March 2023]. （原始內容存檔於2023-03-15）（美國英語）.

[5] Abhishek, Gupta. Getting stakeholder engagement right in responsible AI. VentureBeat. 5 February 2023 [4 March 2023]. （原始內容存檔於2023-03-20）.

[6] Zhu, Banghua; Jordan, Michael; Jiao, Jiantao. Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons. Proceedings of the 40th International Conference on Machine Learning (PMLR). 2023-07-03: 43037–43067 [2023-12-11]. （原始內容存檔於2023-10-27）（英語）.

[7] Wiggers, Kyle. Can AI really be protected from text-based attacks?. TechCrunch. 24 February 2023 [4 March 2023]. （原始內容存檔於2023-03-16）.

[8] Learning from human preferences. openai.com. [4 March 2023]. （原始內容存檔於2023-06-18）.

[9] Learning through human feedback. www.deepmind.com. [4 March 2023]. （原始內容存檔於2023-03-19）（英語）.

[10] Christiano, Paul F; Leike, Jan; Brown, Tom; Martic, Miljan; Legg, Shane; Amodei, Dario. Deep Reinforcement Learning from Human Preferences. Advances in Neural Information Processing Systems (Curran Associates, Inc.). 2017, 30 [4 March 2023]. （原始內容存檔於2023-03-19）.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]