Skip to content

Latest commit

 

History

History
20 lines (15 loc) · 2.62 KB

2312.08935.md

File metadata and controls

20 lines (15 loc) · 2.62 KB

Background

  • Background The current large language models (LLMs) have shown great abilities across a variety of tasks but still face challenges when it comes to complex multi-step mathematical reasoning problems. Existing studies have resorted to different methodologies such as pre-training, fine-tuning, prompting strategies, and verification. Among these, verification has emerged as crucial, where the reranking of candidate answers serves to improve the consistency and accuracy of LLM outputs. Verification models are generally divided into outcome reward model (ORM) and process reward model (PRM), with PRM providing precise feedback by evaluating the reasoning path step-by-step.

  • Existing Work Even though existing verification models have made some progress in improving the performance of LLMs, they typically depend on manual annotations for process supervision, which presents significant challenges in terms of costs and required annotator expertise, limiting the development of PRM in practical applications.

Core Contributions

  • Introduced an automatic process annotation framework
    • Challenge 1: Reducing dependency on manual annotations The automatic process annotation framework proposed in the paper can autonomously collect step-wise supervision data, automatically scoring each step by comparing its association with the correct answer.

    • Challenge 2: Enhancing accuracy in mathematical reasoning tasks Scores for specific problem-solving steps are determined based on the correctness of multiple subsequent reasoning paths decoded from earlier steps, ensuring that only steps capable of deriving more correct answers are assigned higher correctness scores, effectively improving the LLMs' accuracy on mathematical reasoning tasks.

Implementation and Deployment

The proposed framework, MATH-SHEPHERD, was trained using automatically generated process supervision data, testing the authors' ideas on two broadly utilized mathematical benchmarks: GSM8K and MATH. The model compatibility with various LLMs was observed, with DeepSeek 67B achieving an accuracy rate of 93.3% on the GSM8K dataset and 48.1% on a representative subset of the MATH dataset, outperforming the early version of GPT-4 and other existing verification models, making it a top-contender among open-source models.

Summary

MATH-SHEPHERD successfully addresses the issue of costly human annotations by training LLMs with automatically generated supervision data, thereby enhancing the accuracy of LLMs in solving complex mathematical problems and opening up new avenues for the advancement and practical application of LLMs.