MMInA: Benchmarking Multihop Multimodal Internet Agents
Autonomous embodied agents live on an Internet of multimedia websites. Can they hop around multimodal websites to complete complex user tasks? Existing benchmarks fail to assess them in a realistic, evolving environment for their embodiment across websites. To answer this question, we present MMInA,...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Autonomous embodied agents live on an Internet of multimedia websites. Can
they hop around multimodal websites to complete complex user tasks? Existing
benchmarks fail to assess them in a realistic, evolving environment for their
embodiment across websites. To answer this question, we present MMInA, a
multihop and multimodal benchmark to evaluate the embodied agents for
compositional Internet tasks, with several appealing properties: 1) Evolving
real-world multimodal websites. Our benchmark uniquely operates on evolving
real-world websites, ensuring a high degree of realism and applicability to
natural user tasks. Our data includes 1,050 human-written tasks covering
various domains such as shopping and travel, with each task requiring the agent
to autonomously extract multimodal information from web pages as observations;
2) Multihop web browsing. Our dataset features naturally compositional tasks
that require information from or actions on multiple websites to solve, to
assess long-range reasoning capabilities on web tasks; 3) Holistic evaluation.
We propose a novel protocol for evaluating an agent's progress in completing
multihop tasks. We experiment with both standalone (multimodal) language models
and heuristic-based web agents. Extensive experiments demonstrate that while
long-chain multihop web tasks are easy for humans, they remain challenging for
state-of-the-art web agents. We identify that agents are more likely to fail on
the early hops when solving tasks of more hops, which results in lower task
success rates. To address this issue, we propose a simple memory augmentation
approach replaying past action trajectories to reflect. Our method
significantly improved both the single-hop and multihop web browsing abilities
of agents. See our code and data at https://mmina.cliangyu.com |
---|---|
DOI: | 10.48550/arxiv.2404.09992 |