Two Heads are Better than One: Nested PoE for Robust Defense Against Multi-Backdoors
Data poisoning backdoor attacks can cause undesirable behaviors in large language models (LLMs), and defending against them is of increasing importance. Existing defense mechanisms often assume that only one type of trigger is adopted by the attacker, while defending against multiple simultaneous an...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Data poisoning backdoor attacks can cause undesirable behaviors in large
language models (LLMs), and defending against them is of increasing importance.
Existing defense mechanisms often assume that only one type of trigger is
adopted by the attacker, while defending against multiple simultaneous and
independent trigger types necessitates general defense frameworks and is
relatively unexplored. In this paper, we propose Nested Product of
Experts(NPoE) defense framework, which involves a mixture of experts (MoE) as a
trigger-only ensemble within the PoE defense framework to simultaneously defend
against multiple trigger types. During NPoE training, the main model is trained
in an ensemble with a mixture of smaller expert models that learn the features
of backdoor triggers. At inference time, only the main model is used.
Experimental results on sentiment analysis, hate speech detection, and question
classification tasks demonstrate that NPoE effectively defends against a
variety of triggers both separately and in trigger mixtures. Due to the
versatility of the MoE structure in NPoE, this framework can be further
expanded to defend against other attack settings |
---|---|
DOI: | 10.48550/arxiv.2404.02356 |