Diffusion Transformer Policy
Recent large visual-language action models pretrained on diverse robot datasets have demonstrated the potential for generalizing to new environments with a few in-domain data. However, those approaches usually predict discretized or continuous actions by a small action head, which limits the ability...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recent large visual-language action models pretrained on diverse robot
datasets have demonstrated the potential for generalizing to new environments
with a few in-domain data. However, those approaches usually predict
discretized or continuous actions by a small action head, which limits the
ability in handling diverse action spaces. In contrast, we model the continuous
action with a large multi-modal diffusion transformer, dubbed as Diffusion
Transformer Policy, in which we directly denoise action chunks by a large
transformer model rather than a small action head. By leveraging the scaling
capability of transformers, the proposed approach can effectively model
continuous end-effector actions across large diverse robot datasets, and
achieve better generalization performance. Extensive experiments demonstrate
Diffusion Transformer Policy pretrained on diverse robot data can generalize to
different embodiments, including simulation environments like Maniskill2 and
Calvin, as well as the real-world Franka arm. Specifically, without bells and
whistles, the proposed approach achieves state-of-the-art performance with only
a single third-view camera stream in the Calvin novel task setting (ABC->D),
improving the average number of tasks completed in a row of 5 to 3.6, and the
pretraining stage significantly facilitates the success sequence length on the
Calvin by over 1.2. The code will be publicly available. |
---|---|
DOI: | 10.48550/arxiv.2410.15959 |