Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models
Multimodal Large Language Models (MLLMs) extend the capacity of LLMs to understand multimodal information comprehensively, achieving remarkable performance in many vision-centric tasks. Despite that, recent studies have shown that these models are susceptible to jailbreak attacks, which refer to an...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Multimodal Large Language Models (MLLMs) extend the capacity of LLMs to
understand multimodal information comprehensively, achieving remarkable
performance in many vision-centric tasks. Despite that, recent studies have
shown that these models are susceptible to jailbreak attacks, which refer to an
exploitative technique where malicious users can break the safety alignment of
the target model and generate misleading and harmful answers. This potential
threat is caused by both the inherent vulnerabilities of LLM and the larger
attack scope introduced by vision input. To enhance the security of MLLMs
against jailbreak attacks, researchers have developed various defense
techniques. However, these methods either require modifications to the model's
internal structure or demand significant computational resources during the
inference phase. Multimodal information is a double-edged sword. While it
increases the risk of attacks, it also provides additional data that can
enhance safeguards. Inspired by this, we propose Cross-modality Information
DEtectoR (CIDER), a plug-and-play jailbreaking detector designed to identify
maliciously perturbed image inputs, utilizing the cross-modal similarity
between harmful queries and adversarial images. CIDER is independent of the
target MLLMs and requires less computation cost. Extensive experimental results
demonstrate the effectiveness and efficiency of CIDER, as well as its
transferability to both white-box and black-box MLLMs. |
---|---|
DOI: | 10.48550/arxiv.2407.21659 |