Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers
Large-scale multi-task robotic manipulation systems often rely on text to specify the task. In this work, we explore whether a robot can learn by observing humans. To do so, the robot must understand a person's intent and perform the inferred task despite differences in the embodiments and envi...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Large-scale multi-task robotic manipulation systems often rely on text to
specify the task. In this work, we explore whether a robot can learn by
observing humans. To do so, the robot must understand a person's intent and
perform the inferred task despite differences in the embodiments and
environments. We introduce Vid2Robot, an end-to-end video-conditioned policy
that takes human videos demonstrating manipulation tasks as input and produces
robot actions. Our model is trained with a large dataset of prompt video-robot
trajectory pairs to learn unified representations of human and robot actions
from videos. Vid2Robot uses cross-attention transformer layers between video
features and the current robot state to produce the actions and perform the
same task as shown in the video. We use auxiliary contrastive losses to align
the prompt and robot video representations for better policies. We evaluate
Vid2Robot on real-world robots and observe over 20% improvement over BC-Z when
using human prompt videos. Further, we also show cross-object motion transfer
ability that enables video-conditioned policies to transfer a motion observed
on one object in the prompt video to another object in the robot's own
environment. Videos available at https://vid2robot.github.io |
---|---|
DOI: | 10.48550/arxiv.2403.12943 |