CoPa: General Robotic Manipulation through Spatial Constraints of Parts with Foundation Models
Foundation models pre-trained on web-scale data are shown to encapsulate extensive world knowledge beneficial for robotic manipulation in the form of task planning. However, the actual physical implementation of these plans often relies on task-specific learning methods, which require significant da...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Foundation models pre-trained on web-scale data are shown to encapsulate
extensive world knowledge beneficial for robotic manipulation in the form of
task planning. However, the actual physical implementation of these plans often
relies on task-specific learning methods, which require significant data
collection and struggle with generalizability. In this work, we introduce
Robotic Manipulation through Spatial Constraints of Parts (CoPa), a novel
framework that leverages the common sense knowledge embedded within foundation
models to generate a sequence of 6-DoF end-effector poses for open-world
robotic manipulation. Specifically, we decompose the manipulation process into
two phases: task-oriented grasping and task-aware motion planning. In the
task-oriented grasping phase, we employ foundation vision-language models
(VLMs) to select the object's grasping part through a novel coarse-to-fine
grounding mechanism. During the task-aware motion planning phase, VLMs are
utilized again to identify the spatial geometry constraints of task-relevant
object parts, which are then used to derive post-grasp poses. We also
demonstrate how CoPa can be seamlessly integrated with existing robotic
planning algorithms to accomplish complex, long-horizon tasks. Our
comprehensive real-world experiments show that CoPa possesses a fine-grained
physical understanding of scenes, capable of handling open-set instructions and
objects with minimal prompt engineering and without additional training.
Project page: https://copa-2024.github.io/ |
---|---|
DOI: | 10.48550/arxiv.2403.08248 |