"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak
"Jailbreak" is a major safety concern of Large Language Models (LLMs), which occurs when malicious prompts lead LLMs to produce harmful outputs, raising issues about the reliability and safety of LLMs. Therefore, an effective evaluation of jailbreaks is very crucial to develop its mitigati...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | "Jailbreak" is a major safety concern of Large Language Models (LLMs), which
occurs when malicious prompts lead LLMs to produce harmful outputs, raising
issues about the reliability and safety of LLMs. Therefore, an effective
evaluation of jailbreaks is very crucial to develop its mitigation strategies.
However, our research reveals that many jailbreaks identified by current
evaluations may actually be hallucinations-erroneous outputs that are mistaken
for genuine safety breaches. This finding suggests that some perceived
vulnerabilities might not represent actual threats, indicating a need for more
precise red teaming benchmarks. To address this problem, we propose the
$\textbf{B}$enchmark for reli$\textbf{AB}$ilit$\textbf{Y}$ and
jail$\textbf{B}$reak ha$\textbf{L}$l$\textbf{U}$cination $\textbf{E}$valuation
(BabyBLUE). BabyBLUE introduces a specialized validation framework including
various evaluators to enhance existing jailbreak benchmarks, ensuring outputs
are useful malicious instructions. Additionally, BabyBLUE presents a new
dataset as an augmentation to the existing red teaming benchmarks, specifically
addressing hallucinations in jailbreaks, aiming to evaluate the true potential
of jailbroken LLM outputs to cause harm to human society. |
---|---|
DOI: | 10.48550/arxiv.2406.11668 |