Background

Background Language models (LMs) have been producing an increasingly diverse range of text and complex tasks, making quality evaluation a challenging task. The current paradigm for scalable and low-cost evaluation is based on the use of LMs to assess the text generated by other LMs. However, existing open-source evaluation models demonstrate significant shortcomings: their scoring often significantly diverges from human-assigned scores, and they lack the flexibility to operate in both direct assessment and pairwise ranking modes.
Existing Work The existing open evaluator LMs are not able to provide scoring decisions that correlate well enough with human judgments or mimic proprietary LMs like GPT-4. They also lack transparency, controllability, and affordability during evaluations. Moreover, these models are not flexible enough as they are typically trained to perform only direct assessment or pairwise ranking, based on general public preferences like helpfulness and harmlessness, limiting their realization in real-life scenarios.

Core Contributions

Introduced PROMETHEUS 2
- Challenge 1: Assessing accuracy and flexibility Existing models usually can't perform both direct assessment and pairwise ranking formats and fail to effectively evaluate based on specific criteria beyond qualities like helpfulness and harmlessness. PROMETHEUS 2 can handle direct assessment and pairwise ranking formats and provides higher accuracy in assessments based on custom criteria.
- Challenge 2: Performance compared to proprietary models Open evaluator models generally present a performance gap compared to proprietary models like GPT-4. Using a weight merging methodology, PROMETHEUS 2, after being trained, achieves excellent performance in both direct assessment and pairwise ranking with correlation and agreement with human judges as well as consistency with proprietary LM evaluations.
- Additional Contributions The introduction of a pairwise ranking feedback dataset, termed PREFERENCE COLLECTION, which includes 1,000 custom evaluation criteria.

Implementation and Deployment

PROMETHEUS 2 demonstrated the highest correlation and agreement with human evaluators and proprietary LM judges on both four direct assessment benchmarks (like Vicuna Bench, MT Bench) and four pairwise ranking benchmarks (like HHH Alignment, MT Bench Human Judgment), compared to other open evaluator LMs. Notably, the Pearson correlation exceeded other baselines by 0.2 units across all datasets, and the models showed the highest agreement with human evaluations on pairwise ranking benchmarks, cutting the performance gap with GPT-4 in half.

Summary

PROMETHEUS 2 is an innovative open evaluator LM that can operate in both direct assessment and pairwise ranking formats while correlating closely with human judgments and proprietary LM evaluations on custom criteria. The model outperforms other open models and even some proprietary models, thanks to its training using weight merging.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2405.01535.md

2405.01535.md

Background

Core Contributions

Implementation and Deployment

Summary

Files

2405.01535.md

Latest commit

History

2405.01535.md

File metadata and controls

Background

Core Contributions

Implementation and Deployment

Summary