BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models
Large language models (LLMs) are shown to benefit from chain-of-thought (COT) prompting, particularly when tackling tasks that require systematic reasoning processes. On the other hand, COT prompting also poses new vulnerabilities in the form of backdoor attacks, wherein the model will output uninte...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Large language models (LLMs) are shown to benefit from chain-of-thought (COT)
prompting, particularly when tackling tasks that require systematic reasoning
processes. On the other hand, COT prompting also poses new vulnerabilities in
the form of backdoor attacks, wherein the model will output unintended
malicious content under specific backdoor-triggered conditions during
inference. Traditional methods for launching backdoor attacks involve either
contaminating the training dataset with backdoored instances or directly
manipulating the model parameters during deployment. However, these approaches
are not practical for commercial LLMs that typically operate via API access. In
this paper, we propose BadChain, the first backdoor attack against LLMs
employing COT prompting, which does not require access to the training dataset
or model parameters and imposes low computational overhead. BadChain leverages
the inherent reasoning capabilities of LLMs by inserting a backdoor reasoning
step into the sequence of reasoning steps of the model output, thereby altering
the final response when a backdoor trigger exists in the query prompt.
Empirically, we show the effectiveness of BadChain for two COT strategies
across four LLMs (Llama2, GPT-3.5, PaLM2, and GPT-4) and six complex benchmark
tasks encompassing arithmetic, commonsense, and symbolic reasoning. Moreover,
we show that LLMs endowed with stronger reasoning capabilities exhibit higher
susceptibility to BadChain, exemplified by a high average attack success rate
of 97.0% across the six benchmark tasks on GPT-4. Finally, we propose two
defenses based on shuffling and demonstrate their overall ineffectiveness
against BadChain. Therefore, BadChain remains a severe threat to LLMs,
underscoring the urgency for the development of robust and effective future
defenses. |
---|---|
DOI: | 10.48550/arxiv.2401.12242 |