Dissecting Adversarial Robustness of Multimodal LM Agents
As language models (LMs) are used to build autonomous agents in real environments, ensuring their adversarial robustness becomes a critical challenge. Unlike chatbots, agents are compound systems with multiple components, which existing LM safety evaluations do not adequately address. To bridge this...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | As language models (LMs) are used to build autonomous agents in real
environments, ensuring their adversarial robustness becomes a critical
challenge. Unlike chatbots, agents are compound systems with multiple
components, which existing LM safety evaluations do not adequately address. To
bridge this gap, we manually create 200 targeted adversarial tasks and
evaluation functions in a realistic threat model on top of VisualWebArena, a
real environment for web-based agents. In order to systematically examine the
robustness of various multimodal we agents, we propose the Agent Robustness
Evaluation (ARE) framework. ARE views the agent as a graph showing the flow of
intermediate outputs between components and decomposes robustness as the flow
of adversarial information on the graph. First, we find that we can
successfully break a range of the latest agents that use black-box frontier
LLMs, including those that perform reflection and tree-search. With
imperceptible perturbations to a single product image (less than 5% of total
web page pixels), an attacker can hijack these agents to execute targeted
adversarial goals with success rates up to 67%. We also use ARE to rigorously
evaluate how the robustness changes as new components are added. We find that
new components that typically improve benign performance can open up new
vulnerabilities and harm robustness. An attacker can compromise the evaluator
used by the reflexion agent and the value function of the tree search agent,
which increases the attack success relatively by 15% and 20%. Our data and code
for attacks, defenses, and evaluation are available at
https://github.com/ChenWu98/agent-attack |
---|---|
DOI: | 10.48550/arxiv.2406.12814 |