Skip to content

Latest commit

 

History

History
19 lines (14 loc) · 2.5 KB

2401.1002.md

File metadata and controls

19 lines (14 loc) · 2.5 KB

Background

  • Background The paper suggests that to achieve superhuman agents, future models require superhuman feedback to provide an adequate training signal. Current methods often train reward models based on human preferences, which may be limited by human performance levels and by the fact that these separate reward models cannot improve during LLM training.

  • Existing Work Existing methods are bottlenecked by the size and quality of human preference data and the quality of the reward model trained from it. Even recent alternatives like Direct Preference Optimization (DPO) are subject to similar bottlenecks.

Core Contributions

  • Challenge 1: How to avoid the bottleneck of human preference data The paper proposes the concept of a self-improving reward model that, unlike a static model, continues to update during the alignment process of LLMs. The researchers accomplished this by developing an agent that possesses all desired abilities during training rather than separating them into distinct models like a reward model and a language model.

  • Challenge 2: Improving instruction-following ability and reward modeling ability They train the Self-Rewarding Language Models using an Iterative DPO framework. These models not only follow instructions and generate responses for given prompts but also generate and evaluate new instruction-following examples to add to their own training set. Experiments show that instruction-following performance as well as the reward modeling ability of the model are both improved.

Implementation and Deployment

The research team conducted a series of experiments and found that adding Evaluation Fine-Tuning (EFT) tasks does not compromise instruction-following performance. They discovered significant improvements in performance between Iteration 1 (M1) and Iteration 2 (M2) in head-to-head evaluations. Iteration 3 (M3) also showed further gains compared to Iteration 2 in head-to-head evaluations. Additionally, the models performed well on the AlpacaEval 2.0 leaderboard, with training iterations resulting in improved win rates and the Iteration 3 model achieving high win rates among various models including GPT-4 Turbo.

Summary

This work introduces Self-Rewarding Language Models intended to bypass the bottleneck of human preference data by self-training to enhance the model's self-rewarding and instruction-following capabilities. The experimental results are promising, setting a precursor for models that can continuously improve themselves.