Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory
Vision-language pre-training (VLP) models exhibit remarkable capabilities in comprehending both images and text, yet they remain susceptible to multimodal adversarial examples (AEs). Strengthening attacks and uncovering vulnerabilities, especially common issues in VLP models (e.g., high transferable...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Vision-language pre-training (VLP) models exhibit remarkable capabilities in
comprehending both images and text, yet they remain susceptible to multimodal
adversarial examples (AEs). Strengthening attacks and uncovering
vulnerabilities, especially common issues in VLP models (e.g., high
transferable AEs), can advance reliable and practical VLP models. A recent work
(i.e., Set-level guidance attack) indicates that augmenting image-text pairs to
increase AE diversity along the optimization path enhances the transferability
of adversarial examples significantly. However, this approach predominantly
emphasizes diversity around the online adversarial examples (i.e., AEs in the
optimization period), leading to the risk of overfitting the victim model and
affecting the transferability. In this study, we posit that the diversity of
adversarial examples towards the clean input and online AEs are both pivotal
for enhancing transferability across VLP models. Consequently, we propose using
diversification along the intersection region of adversarial trajectory to
expand the diversity of AEs. To fully leverage the interaction between
modalities, we introduce text-guided adversarial example selection during
optimization. Furthermore, to further mitigate the potential overfitting, we
direct the adversarial text deviating from the last intersection region along
the optimization path, rather than adversarial images as in existing methods.
Extensive experiments affirm the effectiveness of our method in improving
transferability across various VLP models and downstream vision-and-language
tasks. |
---|---|
DOI: | 10.48550/arxiv.2403.12445 |