Multimodal Web Navigation with Instruction-Finetuned Foundation Models
The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data. In this work, we study data-drive...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The progress of autonomous web navigation has been hindered by the dependence
on billions of exploratory interactions via online reinforcement learning, and
domain-specific model designs that make it difficult to leverage generalization
from rich out-of-domain data. In this work, we study data-driven offline
training for web agents with vision-language foundation models. We propose an
instruction-following multimodal agent, WebGUM, that observes both webpage
screenshots and HTML pages and outputs web navigation actions, such as click
and type. WebGUM is trained by jointly finetuning an instruction-finetuned
language model and a vision encoder with temporal and local perception on a
large corpus of demonstrations. We empirically demonstrate this recipe improves
the agent's ability of grounded multimodal perception, HTML comprehension, and
multi-step reasoning, outperforming prior works by a significant margin. On the
MiniWoB, we improve over the previous best offline methods by more than 45.8%,
even outperforming online-finetuned SoTA, humans, and GPT-4-based agent. On the
WebShop benchmark, our 3-billion-parameter model achieves superior performance
to the existing SoTA, PaLM-540B. Furthermore, WebGUM exhibits strong positive
transfer to the real-world planning tasks on the Mind2Web. We also collect 347K
high-quality demonstrations using our trained models, 38 times larger than
prior work, and make them available to promote future research in this
direction. |
---|---|
DOI: | 10.48550/arxiv.2305.11854 |