Skip to content

Latest commit

 

History

History
20 lines (15 loc) · 2.53 KB

2312.0435.md

File metadata and controls

20 lines (15 loc) · 2.53 KB

Background

  • Background There has been a long-standing aspiration among scientists in artificial intelligence and causal inference to create a machine capable of conducting sound causal reasoning and responding to causal questions at scale with ease, and recent advances in large language models (LLMs) have led to a paradigm shift in NLP and artificial intelligence.

  • Existing Work Despite significant advancements in the capabilities of LLMs, they face limitations in formal causal reasoning and accurately answering causal questions, particularly when amortized causal reasoning is expected to fail, such as in counterintuitive and nonsensical causal scenarios.

Core Contributions

  • Introducing a method to test formal causal reasoning in LLMs
    • Challenge 1: How to systematically test and evaluate LLMs on formal causal reasoning The challenge is addressed by introducing the CLADDER dataset which anchors causal questions in natural language to symbolic questions with ground truth answers derived by a causal inference engine that follows Judea Pearl's approach to causal reasoning. The dataset consists of over 10,000 causal questions across three levels of the Ladder of Causation: associational, interventional, and counterfactual.

    • Challenge 2: How to guide LLMs to conduct sound causal reasoning and help solve complex causality questions The paper deploys the CAUSALCOT, a chain-of-thought prompting strategy inspired by the causal inference engine that prompts LLMs to extract the causal graph, causal query, and available "data" from the question to output correct causal inferences. Experiments show that CAUSALCOT improves LLMs' performance on the CLADDER dataset.

Implementation and Deployment

After extensive experimentation on various LLMs, it was found that the CAUSALCOT strategy resulted in an overall accuracy of 66.64% on the CLADDER dataset, which is 2.36 percentage points higher than vanilla GPT-4. The study also found that providing task-related examples improved in-context learning, which enhanced the models' generalization to new types of queries. These findings highlight the limitations of LLMs in formal causal reasoning and suggest important directions for future research to improve and enhance LLMs.

Summary

This research introduces the CLADDER dataset and CAUSALCOT chain-of-thought prompting strategy to test and analyze the abilities of large language models (LLMs) in formal causal reasoning, highlighting limitations of LLMs and suggesting future research directions.