Skip to content

The repository contains a subset of the ReDial dataset with additional annotations on relevance and usefulness created in the paper: Context Does Matter: Implications for Crowdsourced Evaluation Labels in Task-Oriented Dialogue Systems, accepted at NAACL'24

Notifications You must be signed in to change notification settings

Clemenciah/Effects-of-Dialogue-Context

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

Context Does Matter: Implications for Crowdsourced Evaluation Labels in Task-Oriented Dialogue Systems, NAACL'24 Findings

The repository consists of data used for analysis to understand how different context sizes and types influence the consistency of human evaluation labels.

Overview

The study underscores the significant role that varying context plays in the reliability and consistency of crowdsourced evaluations. We present a comprehensive analysis using datasets that include multiple context sizes and types, offering insights into how evaluators perceive and judge dialogue quality based on the given context.

In addition, we leverage LLM as a tool to generate user information needs and dialogue summaries.

Repository Structure

  • Annotation interfaces/ - Interfaces and tools used for collecting evaluation data.
  • Data/ - Datasets segmented into relevance and usefulness assessments, reflecting different context settings.
    • relevance/ - Contains datasets specifically assessing the relevance of responses within given contexts.
    • usefulness/ - Contains datasets evaluating the usefulness of responses within contexts.

Data Description

Relevance Assessments

Relevance assesses if the system's response is pertinent to the user’s original request, factoring in the dialogue context. Files in the Relevance directory represent different experimental conditions:

  • C_0.json: Evaluates the relevance of responses with no preceding context.
  • C_3.json: Analyzes the relevance of responses with partial context, including three prior user-system interactions.
  • C_7.json: Assesses the relevance of responses with full context, including seven prior user-system exchanges.
  • C_0-llm.json: Examines the relevance of responses using a user information need generated by an LLM instead of a traditional dialogue context.
  • C_0-heu.json: Tests the relevance of responses against a heuristically generated user information need, replacing traditional dialogue context.
  • C_0-sum.json: Evaluates the relevance of responses within the context of a dialogue summary, replacing traditional dialogue context.

Usefulness Assessments

Usefulness examines how beneficial the system’s response is, considering the user’s articulated information need. This evaluation uniquely includes the user’s next utterance as feedback to assess the utility of the response. The Usefulness directory structure mirrors the Relevance directory but focuses on the practical utility of responses:

  • C_0.json: Evaluates the usefulness of responses with no preceding context.
  • C_3.json: Analyzes the usefulness of responses with partial context, including three prior user-system interactions.
  • C_7.json: Assesses the usefulness of responses with full context, including seven prior user-system exchanges.
  • C_0-llm.json: Examines the usefulness of responses using a user information need generated by a Large Language Model (LLM) instead of the traditional dialogue context.
  • C_0-heu.json: Tests the usefulness of responses against a heuristically generated user information need, replacing traditional dialogue context.
  • C_0-sum.json: Evaluates the utility of responses within the context of a dialogue summary, replacing traditional dialogue context.

Evaluation Metrics

The datasets in this repository include a range of metrics, divided into Main Metrics and Annotator-Based Metrics. These metrics provide a foundation for evaluating system responses in task-oriented dialogue systems and for understanding the impact of context on crowd workers' evaluations.

Main Metrics

  • Relevance: Numerical rating that measures the relevance of the system's response to the user's initial query.
  • Usefulness: Evaluation of how effectively the system's response satisfies the user's informational needs.

Annotator-Based Metrics

These metrics assess additional aspects of the evaluation process, focusing on the annotators' experience and decision-making influenced by the context provided. These metrics were not used in our NAACL 2024 paper but offer valuable data for further research into how contextual information impacts annotator perceptions and evaluation behaviors.

  • Confidence: Reflects the assessor's confidence in their judgments of relevance or usefulness.

  • Task Duration: Time taken by the assessor to complete their evaluation, recorded in seconds.

  • Ease of Evaluation: Assessor's perceived difficulty in evaluating the dialogue, which can vary with the amount and type of context provided.

  • User Preference: Indicates whether the assessor believes the user's preferences were considered in the system's response.

  • Context Usefulness: Assessor's rating of the utility of the provided context in understanding and making evaluations.

  • Feedback: Qualitative feedback provided by the assessor regarding their evaluation experience.

Research Directions for Annotator-Based Metrics

These additional annotator-based metrics open several avenues for future research, providing insights into the effects of contextual information on human evaluation processes:

  • Annotator Decision-Making: How does the amount and type of context influence the speed and confidence of annotators' decisions?
  • Context Dependency: Which types of dialogue systems (e.g., task-oriented vs. open-domain) exhibit stronger dependencies on context for effective evaluation?

Researchers are encouraged to use this dataset to pursue these and other questions related to the dynamics of context in dialogue system evaluation.

License

This project is licensed under the MIT License.

About

The repository contains a subset of the ReDial dataset with additional annotations on relevance and usefulness created in the paper: Context Does Matter: Implications for Crowdsourced Evaluation Labels in Task-Oriented Dialogue Systems, accepted at NAACL'24

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published