NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
Vision-and-language navigation (VLN) stands as a key research problem of Embodied AI, aiming at enabling agents to navigate in unseen environments following linguistic instructions. In this field, generalization is a long-standing challenge, either to out-of-distribution scenes or from Sim to Real....
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Vision-and-language navigation (VLN) stands as a key research problem of
Embodied AI, aiming at enabling agents to navigate in unseen environments
following linguistic instructions. In this field, generalization is a
long-standing challenge, either to out-of-distribution scenes or from Sim to
Real. In this paper, we propose NaVid, a video-based large vision language
model (VLM), to mitigate such a generalization gap. NaVid makes the first
endeavor to showcase the capability of VLMs to achieve state-of-the-art level
navigation performance without any maps, odometers, or depth inputs. Following
human instruction, NaVid only requires an on-the-fly video stream from a
monocular RGB camera equipped on the robot to output the next-step action. Our
formulation mimics how humans navigate and naturally gets rid of the problems
introduced by odometer noises, and the Sim2Real gaps from map or depth inputs.
Moreover, our video-based approach can effectively encode the historical
observations of robots as spatio-temporal contexts for decision making and
instruction following. We train NaVid with 510k navigation samples collected
from continuous environments, including action-planning and
instruction-reasoning samples, along with 763k large-scale web data. Extensive
experiments show that NaVid achieves state-of-the-art performance in simulation
environments and the real world, demonstrating superior cross-dataset and
Sim2Real transfer. We thus believe our proposed VLM approach plans the next
step for not only the navigation agents but also this research field. |
---|---|
DOI: | 10.48550/arxiv.2402.15852 |