Skip to content

Latest commit

 

History

History
20 lines (15 loc) · 2.38 KB

2407.00782.md

File metadata and controls

20 lines (15 loc) · 2.38 KB

Background

  • Background This paper discusses the challenges associated with solving mathematical problems using large language models (LLMs). Existing studies have shown that Direct Preference Optimization (DPO) can improve the performance of LLMs on downstream tasks like reasoning and alignment. However, mathematical problem solving is diverse, with incorrect reasoning paths sometimes leading to the correct answer accidentally, which makes judging the quality of model-generated text based solely on the final answer imprecise.

  • Existing Work The existing work usually requires human or AI feedback to guide the models, such as process supervision, which needs extensive and expensive manual annotations and is only applicable to traditional reinforcement learning algorithms.

Core Contributions

  • Introduced Step-Controlled DPO (SCDPO)
    • Challenge 1: Automatically generate dispreferred samples that start making errors at a certain step Traditional optimization methods may not capture the complexity of the mathematical problem-solving process. SCDPO provides clear step-by-step feedback supervision by automatically generating dispreferred samples that begin making errors at a certain step, without needing additional manual annotations.

    • Challenge 2: Improve the problem-solving performance of mathematical problems Using SCDPO can help models learn more detailed reasoning skills. The study has shown that SCDPO significantly improves the performance of LLMs in solving mathematical problems by comparing the performance of traditional DPO and SCDPO on different SFT models.

Implementation and Deployment

The SCDPO method proposed in the paper has been applied to three different SFT models, including one existing SFT model and two fine-tuned SFT models. The results show that SCDPO consistently outperforms the traditional DPO method across all three SFT models. Moreover, authors fine-tuned an InternLM2-20B model using SCDPO, which achieved high scores of 88.5% on GSM8K and 58.1% on MATH datasets, performing well in comparison to other open-source large models.

Summary

The paper presents a new method of mathematical reasoning optimization - SCDPO, which significantly enhances the performance of LLMs in solving mathematical problems by generating training samples that supervise errors at specific steps, demonstrating the potential of this method.