-
Background The paper investigates how Large Language Models (LLMs) resolve conflicts when facing competing pressures, focusing on Llama-2-chat models and the 'forbidden fact' task.
-
Existing Work Existing work has focused on describing a circuit with a distinct task, leaving the issue of how models arbitrate between conflicting goals unattended.
- Introduced a 'forbidden fact' task
-
Challenge 1: Assessing model behavior under competing objectives By constructing a dataset and filtering for facts that Llama-2 can correctly answer, the study measures the impactful behavior of forbidding the correct response.
-
Challenge 2: Decomposing Llama-2 to understand its suppression mechanism Llama-2 is broken down into 1000+ components, which are then ranked by importance via first-order patching.
-
The 'forbidden fact dataset' constructed for the study was used to test the model. The results show that on average, forbidding the correct answer decreases the odds of the correct response by over 1000 times compared to forbidding an incorrect one. Furthermore, by decomposing and ranking the 1057 components of Llama-2, the study finds that around 35 components are sufficient to implement the full suppression behavior. These components are quite heterogeneous, many of which operate on faulty heuristics. The authors also discovered that one of these heuristics could be exploited through a manually designed adversarial attack, dubbed "The California Attack." These findings emphasize some of the roadblocks in the way of successfully interpreting advanced ML systems.
The paper provides insights into how the Llama-2-chat model handles competing objectives through the study of its behavior in the 'forbidden fact' task, introducing novel analytical methods in the process.