Skip to content

Latest commit

 

History

History
20 lines (15 loc) · 2.73 KB

2405.1887.md

File metadata and controls

20 lines (15 loc) · 2.73 KB

Background

  • Background The paper discusses the extent to which Large Language Models (LLMs) have developed higher-order Theory of Mind (ToM), a human ability to recursively reason about multiple mental and emotional states (e.g., "I think that you believe that she knows"). ToM is central to human social intelligence, enabling the prediction and influence of behavior.

  • Existing Work Existing literature has primarily focused on 2nd-order ToM in LLMs, concerned with the handling of one mental state. However, as LLMs are increasingly deployed in multi-party social interaction contexts requiring higher-order ToM reasoning, the current work does not address this challenge.

Core Contributions

  • Introducing a new benchmark
    • Challenge 1: Higher-order theory of mind reasoning The performance of current LLMs on higher-order theory of mind reasoning tasks is unclear when compared to an adult human benchmark. To address this, the paper introduces MoToMQA (Multi-Order Theory of Mind Q&A) as a new benchmark based on psychological tests designed to assess adults' higher-order ToM abilities, answering true/false questions about characters in short-form stories. This benchmark provides a more accurate comparison of LLMs' performance with humans on higher-order ToM tasks.

    • Challenge 2: Comparison with adult human benchmark Comparing the performance of LLMs to a children's benchmark instead of adults may not be accurate, as LLMs' main interaction partners will be adults, and human higher-order ToM capacities continue to develop into early adulthood. The researchers calibrate LLM results against a newly-collected adult human benchmark in this paper, using log probabilities (logprobs) as a measure of the LLMs' preferred responses to enhance the robustness of the data.

Implementation and Deployment

The experiments in the paper show that GPT-4 and Flan-PaLM achieve at-human or near-human performance on ToM tasks, respectively. GPT-4 even exceeds adult performance on 6th order inferences. This suggests an interplay between model size and fine-tuning for realizing ToM abilities, and that the best-performing LLMs have developed a generalized capacity for ToM. These findings have significant implications for user-facing LLM applications, as higher-order ToM plays a key role in a wide range of cooperative and competitive human behaviors.

Summary

The study showcases the performance of LLMs on higher-order Theory of Mind (ToM) tasks, specifically demonstrating that models like GPT-4 can achieve adult-level performance on some tasks. The introduction of a new benchmark based on an adult human benchmark helps to reveal and understand the potential and limitations of LLMs in complex social interactions.