Skip to content

Latest commit

 

History

History
20 lines (15 loc) · 1.89 KB

2401.00757.md

File metadata and controls

20 lines (15 loc) · 1.89 KB

Background

  • Background The paper addresses the breakthroughs made by large language models (LLMs) in tasks like writing assistance, code generation, and machine translation. The focus is put on the reasoning capacity demonstrated by advanced LLMs like ChatGPT, despite the challenge in evaluating this capacity due to the current assessments' emphasis on accuracy rather than reasoning processes.

  • Existing Work Previous attempts to evaluate LLMs' reasoning abilities fall short due to issues like data leakage and limited scope.

Core Contributions

  • Introduced an automatic method named LogicAsker
    • Challenge 1: Assessing and improving LLMs' logical reasoning capabilities LogicAsker offers a comprehensive evaluation and improvement method for LLMs' logical reasoning skills based on propositional and predicate logic and uncovers the logical rules not well learned by LLMs.

    • Challenge 2: Identifying logical reasoning failures test cases in different LLMs Test cases from LogicAsker managed to find logical reasoning failures in various LLMs with a rate of 25% to 94%. Additionally, these test cases can be used to design examples for in-context learning, which effectively enhances the logical reasoning abilities of LLMs.

Implementation and Deployment

LogicAsker was evaluated on six widely deployed LLMs including GPT-3, ChatGPT, GPT-4, Bard, Vicuna, and Guanaco, showing a rate of 25% to 94% for finding logical reasoning failures. Furthermore, the use of LogicAsker's test cases for creating demonstration examples significantly improved logical reasoning capabilities, such as a 10% improvement for GPT-4.

Summary

This work proposes LogicAsker, addressing the challenge of evaluating and improving the logical reasoning abilities of LLMs through comprehensive assessment and effective enhancement via problem generation and in-context learning.