Skip to content

Latest commit

 

History

History
18 lines (15 loc) · 2.37 KB

2402.13249.md

File metadata and controls

18 lines (15 loc) · 2.37 KB

Background

  • Background This paper discusses progress made in faithfulness in the domain of single document news summarization, particularly through research on evaluating factual consistency or hallucinations, and questions whether these advances extend to other text summarization domains, specifically for topic-focused dialogue summarization.
  • Existing Work Existing research has largely focused on news summarization, yet challenges arise in dialogue summarization due to the informal and colloquial nature of dialogues and the interaction's need for models to adeptly recontextualize information within conversations.

Core Contributions

  • Proposed an assessment benchmark
    • Challenge 1: Factual consistency assessment of topic-focused dialogue summarization Researchers proposed TOFUEVAL, an evaluation benchmark consisting of 100 dialogues with 15 LLM-generated summaries each, accompanied by detailed human annotations on factual consistency, relevance, and completeness. The study revealed a substantial amount of factual errors in dialogue summary generation by LLMs of varying sizes.
    • Challenge 2: LLM capabilities as factual consistency evaluators The paper examined the performance of LLMs as factual consistency evaluators, finding that their performance in making binary factual consistency predictions was inadequate compared to specialized state-of-the-art (SOTA) evaluation methods. The paper noted that there were no optimization prompts yet for reducing error rates for specific error types.

Implementation and Deployment

The study found significant factual consistency errors in LLM-generated summaries, especially at the summary level, through experiments and human annotations. Moreover, regarding binary factual consistency predictions, some of the latest models, including GPT-4, did not perform well. The research hence established a factuality evaluation benchmark based on summaries generated by LLMs and analyzed the performance of various models and evaluation methods, including non-LLM SOTA factuality metrics and LLMs as evaluators.

Summary

The article introduces TOFUEVAL, a new assessment benchmark for evaluating the factual consistency of LLMs in generating topic-focused dialogue summaries. The study uncovered extensive factual errors in the summaries generated by LLMs of varying sizes within the domain of dialogue.