Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models
A central challenge towards developing robots that can relate human language to their perception and actions is the scarcity of natural language annotations in diverse robot datasets. Moreover, robot policies that follow natural language instructions are typically trained on either templated languag...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | A central challenge towards developing robots that can relate human language
to their perception and actions is the scarcity of natural language annotations
in diverse robot datasets. Moreover, robot policies that follow natural
language instructions are typically trained on either templated language or
expensive human-labeled instructions, hindering their scalability. To this end,
we introduce NILS: Natural language Instruction Labeling for Scalability. NILS
automatically labels uncurated, long-horizon robot data at scale in a zero-shot
manner without any human intervention. NILS combines pretrained vision-language
foundation models in order to detect objects in a scene, detect object-centric
changes, segment tasks from large datasets of unlabelled interaction data and
ultimately label behavior datasets. Evaluations on BridgeV2, Fractal, and a
kitchen play dataset show that NILS can autonomously annotate diverse robot
demonstrations of unlabeled and unstructured datasets while alleviating several
shortcomings of crowdsourced human annotations, such as low data quality and
diversity. We use NILS to label over 115k trajectories obtained from over 430
hours of robot data. We open-source our auto-labeling code and generated
annotations on our website: http://robottasklabeling.github.io. |
---|---|
DOI: | 10.48550/arxiv.2410.17772 |