Skip to content

Latest commit

 

History

History
17 lines (14 loc) · 2.29 KB

2403.18349.md

File metadata and controls

17 lines (14 loc) · 2.29 KB

Background

  • Background LLMs often generate erroneous outputs known as hallucinations due to their inability to discern questions beyond their knowledge scope. Current research has focused on enhancing correctness but has relatively ignored the importance of rejection mechanisms.
  • Existing Work To prevent questions that fall outside of their knowledge scope, many studies have attempted to augment the knowledge of LLMs through curated training data or retrieval-augmented generation (RAG) during inference. Nonetheless, it is acknowledged that even the most powerful models, like GPT-4, are still prone to hallucinations, suggesting that the issue stems from the model's misalignment with its knowledge boundaries.

Core Contributions

  • Challenge 1: Evaluating the honesty of LLMs is complex Assessing the model's ability to discriminate between what it knows and does not know is viewed as a classification problem. However, accuracy may differ significantly between models or even with different prompts, and traditional classification metrics are unsuitable due to the lack of true labels and dynamic label distribution.
  • Challenge 2: Existing alignment frameworks struggle to effectively enhance the honesty of models The reward models in existing alignment frameworks like RLHF are trained primarily on model-agnostic preference data, making it challenging to enhance honesty across different models using generic alignment data.

Implementation and Deployment

To tackle the above challenges, the researchers introduced the Reinforcement Learning from Knowledge Feedback (RLKF) framework. RLKF uses knowledge feedback to dynamically determine the model's knowledge boundary and trains a reliable reward model to encourage rejecting questions beyond this boundary. Experimental results on mathematical questions significantly enhance the reliability of LLMs. The framework's effectiveness was demonstrated through experiments on an in-domain arithmetic dataset and an out-of-domain GSM8K dataset, significantly improving the reliability of baseline models.

Summary

This paper effectively addresses LLM hallucinations and enhances model honesty and reliability by introducing the RLKF framework and defining new evaluation metrics, pointing towards a method for building more trustworthy AI systems.