Skip to content

Latest commit

 

History

History
20 lines (15 loc) · 2.83 KB

2401.00761.md

File metadata and controls

20 lines (15 loc) · 2.83 KB

Background

  • Background Large Language Models (LLMs) have become fundamental in various applications due to the extensive knowledge they acquire through pre-training and fine-tuning processes. However, they are prone to generating factual and commonsense errors, causing concerns in critical areas such as healthcare, journalism, and education, which could mislead users.

  • Existing Work Current methods for triggering errors in LLMs have shortcomings that need to be addressed: (1) High Costs: Existing benchmarks heavily rely on question formulation and human annotation, which is effort-intensive; (2) Data Contamination: Evaluations of LLMs suffer from data pollution, as the internet-sourced corpora they are trained on may include publicly available evaluation data; (3) Limited Coverage: Previous approaches often have a limited scope and syntax, focusing on specific relations or constrained by limited question syntax; (4) Different Testbed: The majority of QA system testing frameworks focus on closed-domain QA models, not reflecting LLMs' typical usage.

Core Contributions

  • Introduced an automated testing framework called FactChecker
    • Challenge 1: Inefficient error detection FactChecker addresses this challenge by using an automatically constructed knowledge graph and a rule-based question generation process to automatically locate factual inaccuracies in LLMs.

    • Challenge 2: Data pollution and limited coverage FactChecker builds a knowledge graph from fact triplets retrieved from a large-scale knowledge database and generates questions related to these triplets, which reduces the risk of data leakage and adds diversity to testing questions to more accurately reveal factual errors in LLMs.

Implementation and Deployment

FactChecker starts by creating a structured knowledge graph using knowledge triplets from databases such as Wikidata. It then automatically generates a variety of questions, including Yes-No, Multiple-Choice (MC), and WH questions, addressing single-hop and multi-hop relations across different topics and entities. By comparing LLMs' responses to expected answers from the knowledge graph, the framework effectively identifies potential factual errors. Extensive tests on six well-known LLMs showed that FactChecker could trigger factual errors in up to 45% of questions across models. It also demonstrated that FactChecker’s test cases could improve LLMs’ factual accuracy through in-context learning and fine-tuning (e.g., llama-2-13b-chat’s accuracy increased from 35.3% to 68.5%).

Summary

The FactChecker introduced in this paper provides a new automated framework for testing factual inaccuracies in large language models and has been shown to uncover and reduce factual errors in these models through the construction of knowledge graphs and the generation of test questions.