Background

Background The paper investigates how Large Language Models (LLMs) resolve conflicts when facing competing pressures, focusing on Llama-2-chat models and the 'forbidden fact' task.
Existing Work Existing work has focused on describing a circuit with a distinct task, leaving the issue of how models arbitrate between conflicting goals unattended.

Core Contributions

Introduced a 'forbidden fact' task
- Challenge 1: Assessing model behavior under competing objectives By constructing a dataset and filtering for facts that Llama-2 can correctly answer, the study measures the impactful behavior of forbidding the correct response.
- Challenge 2: Decomposing Llama-2 to understand its suppression mechanism Llama-2 is broken down into 1000+ components, which are then ranked by importance via first-order patching.

Implementation and Deployment

The 'forbidden fact dataset' constructed for the study was used to test the model. The results show that on average, forbidding the correct answer decreases the odds of the correct response by over 1000 times compared to forbidding an incorrect one. Furthermore, by decomposing and ranking the 1057 components of Llama-2, the study finds that around 35 components are sufficient to implement the full suppression behavior. These components are quite heterogeneous, many of which operate on faulty heuristics. The authors also discovered that one of these heuristics could be exploited through a manually designed adversarial attack, dubbed "The California Attack." These findings emphasize some of the roadblocks in the way of successfully interpreting advanced ML systems.

Summary

The paper provides insights into how the Llama-2-chat model handles competing objectives through the study of its behavior in the 'forbidden fact' task, introducing novel analytical methods in the process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2312.08793.md

2312.08793.md

Background

Core Contributions

Implementation and Deployment

Summary

Files

2312.08793.md

Latest commit

History

2312.08793.md

File metadata and controls

Background

Core Contributions

Implementation and Deployment

Summary