Breach By A Thousand Leaks: Unsafe Information Leakage in `Safe' AI Responses
Vulnerability of Frontier language models to misuse and jailbreaks has prompted the development of safety measures like filters and alignment training in an effort to ensure safety through robustness to adversarially crafted prompts. We assert that robustness is fundamentally insufficient for ensuri...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Vulnerability of Frontier language models to misuse and jailbreaks has
prompted the development of safety measures like filters and alignment training
in an effort to ensure safety through robustness to adversarially crafted
prompts. We assert that robustness is fundamentally insufficient for ensuring
safety goals, and current defenses and evaluation methods fail to account for
risks of dual-intent queries and their composition for malicious goals. To
quantify these risks, we introduce a new safety evaluation framework based on
impermissible information leakage of model outputs and demonstrate how our
proposed question-decomposition attack can extract dangerous knowledge from a
censored LLM more effectively than traditional jailbreaking. Underlying our
proposed evaluation method is a novel information-theoretic threat model of
inferential adversaries, distinguished from security adversaries, such as
jailbreaks, in that success is measured by inferring impermissible knowledge
from victim outputs as opposed to forcing explicitly impermissible outputs from
the victim. Through our information-theoretic framework, we show that to ensure
safety against inferential adversaries, defense mechanisms must ensure
information censorship, bounding the leakage of impermissible information.
However, we prove that such defenses inevitably incur a safety-utility
trade-off. |
---|---|
DOI: | 10.48550/arxiv.2407.02551 |