SAGE: Bridging Semantic and Actionable Parts for GEneralizable Manipulation of Articulated Objects
To interact with daily-life articulated objects of diverse structures and functionalities, understanding the object parts plays a central role in both user instruction comprehension and task execution. However, the possible discordance between the semantic meaning and physics functionalities of the...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | To interact with daily-life articulated objects of diverse structures and
functionalities, understanding the object parts plays a central role in both
user instruction comprehension and task execution. However, the possible
discordance between the semantic meaning and physics functionalities of the
parts poses a challenge for designing a general system. To address this
problem, we propose SAGE, a novel framework that bridges semantic and
actionable parts of articulated objects to achieve generalizable manipulation
under natural language instructions. More concretely, given an articulated
object, we first observe all the semantic parts on it, conditioned on which an
instruction interpreter proposes possible action programs that concretize the
natural language instruction. Then, a part-grounding module maps the semantic
parts into so-called Generalizable Actionable Parts (GAParts), which inherently
carry information about part motion. End-effector trajectories are predicted on
the GAParts, which, together with the action program, form an executable
policy. Additionally, an interactive feedback module is incorporated to respond
to failures, which closes the loop and increases the robustness of the overall
framework. Key to the success of our framework is the joint proposal and
knowledge fusion between a large vision-language model (VLM) and a small
domain-specific model for both context comprehension and part perception, with
the former providing general intuitions and the latter serving as expert facts.
Both simulation and real-robot experiments show our effectiveness in handling a
large variety of articulated objects with diverse language-instructed goals. |
---|---|
DOI: | 10.48550/arxiv.2312.01307 |