Skip to content

Latest commit

 

History

History
20 lines (15 loc) · 2.98 KB

2403.05171.md

File metadata and controls

20 lines (15 loc) · 2.98 KB

Background

  • Background This paper discusses the effectiveness of Reinforcement Learning from Human Feedback (RLHF) in aligning Large Language Models (LLMs) with human preferences. Nevertheless, an important issue arises due to the limited dataset of human preferences, resulting in an imperfect reward model that often fails to accurately represent underlying human preferences. This approximation error leads to unreliable rewards and the phenomenon of "reward over-optimization," where the LLM erroneously exploits high-reward states, artificially inflating the proxy reward while decreasing the actual reward. This misalignment makes LLMs favor reward maximization over content quality and user alignment.

  • Existing Work The current strategies to mitigate reward over-optimization primarily focus on penalizing samples with high reward uncertainty during the RL-based policy training. These approaches utilize ensembles of different reward models trained with various seeds, leveraging the variance in estimated rewards to quantify uncertainty. However, the computational cost associated with maintaining multiple reward models in memory during policy optimization is impractical for real-world application.

Core Contributions

  • Introduced Adversarial Policy Optimization (AdvPO)
    • Challenge 1: High Computational Costs The authors offer a lightweight approach to quantify reward uncertainty, relying only on the last layer embeddings of the reward model, eliminating the need for computationally expensive reward ensembles. AdvPO tackles a distributionally robust optimization problem centered around the confidence interval of the reward model's predictions for policy improvement.

    • Challenge 2: Reward Over-Optimization Issue AdvPO adversarially searches around the confidence interval of the reward model's predictions for policy optimization. This method is less conservative in using uncertainty for policy optimization compared to previous work on addressing over-optimization. Additionally, experiments demonstrate that AdvPO effectively addresses the reward over-optimization issue on the Anthropic HH and TL;DR summarization datasets, validated through human-assisted evaluations of the learned policies.

Implementation and Deployment

AdvPO mitigates reward over-optimization while avoiding the computational burden associated with ensemble-based methods. It is shown to reduce over-optimization and enhance overall policy performance, as validated by human-assisted evaluations. Experimental results indicate that AdvPO is effective in real-world scenarios when compared to existing methods that incorporate uncertainty and standard PPO.

Summary

The paper presents Adversarial Policy Optimization (AdvPO), a novel approach to tackling reward over-optimization issues within the RLHF process, especially in LLMs aimed at aligning with human preferences. AdvPO effectively alleviates the problem of reward over-optimization without incurring high computational costs.