Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning
In this study, we are interested in imbuing robots with the capability of physically-grounded task planning. Recent advancements have shown that large language models (LLMs) possess extensive knowledge useful in robotic tasks, especially in reasoning and planning. However, LLMs are constrained by th...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this study, we are interested in imbuing robots with the capability of
physically-grounded task planning. Recent advancements have shown that large
language models (LLMs) possess extensive knowledge useful in robotic tasks,
especially in reasoning and planning. However, LLMs are constrained by their
lack of world grounding and dependence on external affordance models to
perceive environmental information, which cannot jointly reason with LLMs. We
argue that a task planner should be an inherently grounded, unified multimodal
system. To this end, we introduce Robotic Vision-Language Planning (ViLa), a
novel approach for long-horizon robotic planning that leverages vision-language
models (VLMs) to generate a sequence of actionable steps. ViLa directly
integrates perceptual data into its reasoning and planning process, enabling a
profound understanding of commonsense knowledge in the visual world, including
spatial layouts and object attributes. It also supports flexible multimodal
goal specification and naturally incorporates visual feedback. Our extensive
evaluation, conducted in both real-robot and simulated environments,
demonstrates ViLa's superiority over existing LLM-based planners, highlighting
its effectiveness in a wide array of open-world manipulation tasks. |
---|---|
DOI: | 10.48550/arxiv.2311.17842 |