Skip to content

Latest commit

 

History

History
20 lines (15 loc) · 3.05 KB

2404.19733.md

File metadata and controls

20 lines (15 loc) · 3.05 KB

Background

  • Background Existing approaches to preference optimization have shown significant improvements in aligning pretrained language models with human requirements, but they generally have limited improvement on reasoning tasks. While methods like Self-Rewarding LLMs, SPIN, etc., have been applied, their performance on standard reasoning tasks falls short.

  • Existing Work The current methods perform well in general instruction tuning tasks but only make moderate gains or even decrease in performance on standard reasoning tasks. Moreover, the application of preference optimization in training generative reasoning models is not employed in existing methods such as STaR, RestEM, and V-STaR.

Core Contributions

  • Developing an iterative approach
    • Challenge 1: How to optimize preferences between competing generated Chain-of-Thought (CoT) candidates to increase the efficiency of reasoning forms that lead to the correct answer This challenge lies in constructing an appropriate training data set for preference pairs of right and wrong answers and optimizing the model to learn these preferences. The paper addresses this challenge by introducing a modified version of DPO (Dynamic Preference Optimization) loss with an additional Negative Log-Likelihood (NLL) loss term. The results show that this combination is crucial for performance, successfully improving the model's reasoning capability after several rounds of iteration.

    • Challenge 2: Improve performance on reasoning tasks by relying only on examples in the training set without additional datasets Training models typically rely on external datasets to enhance their reasoning capabilities. However, the authors found that it is possible to effectively improve reasoning performance just by iterative optimization methods using examples from the training set, achieving significant performance improvements across three different datasets.

Implementation and Deployment

Significant improvements were achieved through Iterative Reasoning Preference Optimization (Iterative RPO) on several benchmark tests. For instance, zero-shot performance on the GSM8K dataset improved from 55.6% to 81.6%, and with majority voting out of 32 samples, the score went up to 88.7%. On ARC-Challenge, the improvement was from 77.8% to 86.7% without the use of additional datasets, surpassing other Llama-2-based models. Moreover, the researchers provided ablation studies to indicate which components led to these improvements. Overall, this method offers a simple and effective recipe for improving the reasoning abilities of LLMs across a wide range of tasks.

Summary

The paper proposed an iterative reasoning preference optimization method, which applies preference optimization to reasoning tasks, particularly for Chain-of-Thought (CoT) reasoning, and enhances model performance by introducing NLL loss term in iterative training. Experiments showed that the method effectively improved reasoning performance after several iterations, ultimately reaching a performance saturation.