Skip to content

Latest commit

 

History

History
20 lines (15 loc) · 2.82 KB

2406.12624.md

File metadata and controls

20 lines (15 loc) · 2.82 KB

Background

  • Background The paper addresses the alignment and vulnerabilities of LLMs when functioning as judges. The LLM-as-a-judge paradigm offers a promising solution to the scalability problems associated with human evaluation and is becoming increasingly popular for assessing LLMs. However, questions remain about the strengths and weaknesses of this approach, as well as any biases it may have.

  • Existing Work Past research has primarily focused on automated methods for LLM evaluation using benchmarks like MMLU, TruthfulQA, GSM8K, or leaderboards like Chatbot Arena and Open LLM Leaderboard. However, these methods still typically require human involvement in scoring, leading to high costs and impracticalities. Moreover, the multi-choice question (MCQ) benchmarks limit the range of abilities that can be evaluated and are increasingly diverging from how LLMs are used in practice.

Core Contributions

  • Comprehensive assessment of the performance and biases of LLMs serving as judges
    • Challenge 1: How to assess the accuracy of LLM judges under conditions of high human agreement (up to 96%) Researchers used TriviaQA as a benchmark for assessing the objective knowledge reasoning of LLMs and compared them with human annotations. They discovered that employing Cohen's Kappa as a metric of alignment rather than simple percent agreement could better differentiate judges, showing that even high percent agreement might still assign vastly different scores.

    • Challenge 2: Evaluating the effects of instruction length and leniency bias on judge models The study found that even in straightforward setups, only the best models are suitable as judges. GPT-4 Turbo and Llama-3 70B, while showing high congruence with humans, still had alignment rates lower than the human judges, and additional error analysis and studies provided insights for future use of LLMs as judges.

Implementation and Deployment

The study evaluated nine judge models and nine exam-taker models, including base and instruction-tuned variants. The results showed that while Llama-3 70B and GPT-4 Turbo have excellent alignment, were surpassed by JudgeLM-7B and lexical matching methods like Contains, with up to 34 percentage points lower human alignment. The comparison of the judge models’ average percent agreement and Cohen's Kappa scores with human judges revealed that most judge models have poor alignment, but Llama-3 70B and GPT-4 Turbo's Cohen's Kappa coefficients indicate excellent alignment.

Summary

This study provides useful insights for future use of LLMs as judges by evaluating the alignment and vulnerabilities of LLMs acting as judges. Key findings include that only some top models are fit to act as judges, and Cohen's Kappa is a better metric of alignment, outperforming percent agreement in distinguishing judges.