Forbidden Facts: An Investigation of Competing Objectives in Llama-2
LLMs often face competing pressures (for example helpfulness vs. harmlessness). To understand how models resolve such conflicts, we study Llama-2-chat models on the forbidden fact task. Specifically, we instruct Llama-2 to truthfully complete a factual recall statement while forbidding it from sayin...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | LLMs often face competing pressures (for example helpfulness vs.
harmlessness). To understand how models resolve such conflicts, we study
Llama-2-chat models on the forbidden fact task. Specifically, we instruct
Llama-2 to truthfully complete a factual recall statement while forbidding it
from saying the correct answer. This often makes the model give incorrect
answers. We decompose Llama-2 into 1000+ components, and rank each one with
respect to how useful it is for forbidding the correct answer. We find that in
aggregate, around 35 components are enough to reliably implement the full
suppression behavior. However, these components are fairly heterogeneous and
many operate using faulty heuristics. We discover that one of these heuristics
can be exploited via a manually designed adversarial attack which we call The
California Attack. Our results highlight some roadblocks standing in the way of
being able to successfully interpret advanced ML systems. Project website
available at https://forbiddenfacts.github.io . |
---|---|
DOI: | 10.48550/arxiv.2312.08793 |