Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation

Current efforts to learn scalable policies in robotic manipulation primarily fall into two categories: one focuses on "action," which involves behavior cloning from extensive collections of robotic data, while the other emphasizes "vision," enhancing model generalization by pre-t...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-12
Hauptverfasser: Yang, Tian, Yang, Sizhe, Zeng, Jia, Wang, Ping, Lin, Dahua, Dong, Hao, Pang, Jiangmiao
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Yang, Tian
Yang, Sizhe
Zeng, Jia
Wang, Ping
Lin, Dahua
Dong, Hao
Pang, Jiangmiao
description Current efforts to learn scalable policies in robotic manipulation primarily fall into two categories: one focuses on "action," which involves behavior cloning from extensive collections of robotic data, while the other emphasizes "vision," enhancing model generalization by pre-training representations or generative models, also referred to as world models, using large-scale visual datasets. This paper presents an end-to-end paradigm that predicts actions using inverse dynamics models conditioned on the robot's forecasted visual states, named Predictive Inverse Dynamics Models (PIDM). By closing the loop between vision and action, the end-to-end PIDM can be a better scalable action learner. In practice, we use Transformers to process both visual states and actions, naming the model Seer. It is initially pre-trained on large-scale robotic datasets, such as DROID, and can be adapted to realworld scenarios with a little fine-tuning data. Thanks to large-scale, end-to-end training and the synergy between vision and action, Seer significantly outperforms previous methods across both simulation and real-world experiments. It achieves improvements of 13% on the LIBERO-LONG benchmark, 21% on CALVIN ABC-D, and 43% in real-world tasks. Notably, Seer sets a new state-of-the-art on CALVIN ABC-D benchmark, achieving an average length of 4.28, and exhibits superior generalization for novel objects, lighting conditions, and environments under high-intensity disturbances on real-world scenarios. Code and models are publicly available at https://github.com/OpenRobotLab/Seer/.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3147567376</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3147567376</sourcerecordid><originalsourceid>FETCH-proquest_journals_31475673763</originalsourceid><addsrcrecordid>eNqNyt0KgjAYgOERBEl5Dx90LOimrvN-KCjo71zm_ITJ2mybQnefB11AR-_B-8xIRBnLkk1O6YLE3ndpmtKS06JgEbldHTZKBjUinMyIziPsPka8lPRwsQ1qD8IhPKTQotYIZxTOTAxa6-BuaxuUhIswqh-0CMqaFZm3QnuMf12S9WH_3B6T3tn3gD5UnR2cmVbFspwXJWe8ZP-pL6w1P3Q</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3147567376</pqid></control><display><type>article</type><title>Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation</title><source>Free E- Journals</source><creator>Yang, Tian ; Yang, Sizhe ; Zeng, Jia ; Wang, Ping ; Lin, Dahua ; Dong, Hao ; Pang, Jiangmiao</creator><creatorcontrib>Yang, Tian ; Yang, Sizhe ; Zeng, Jia ; Wang, Ping ; Lin, Dahua ; Dong, Hao ; Pang, Jiangmiao</creatorcontrib><description>Current efforts to learn scalable policies in robotic manipulation primarily fall into two categories: one focuses on "action," which involves behavior cloning from extensive collections of robotic data, while the other emphasizes "vision," enhancing model generalization by pre-training representations or generative models, also referred to as world models, using large-scale visual datasets. This paper presents an end-to-end paradigm that predicts actions using inverse dynamics models conditioned on the robot's forecasted visual states, named Predictive Inverse Dynamics Models (PIDM). By closing the loop between vision and action, the end-to-end PIDM can be a better scalable action learner. In practice, we use Transformers to process both visual states and actions, naming the model Seer. It is initially pre-trained on large-scale robotic datasets, such as DROID, and can be adapted to realworld scenarios with a little fine-tuning data. Thanks to large-scale, end-to-end training and the synergy between vision and action, Seer significantly outperforms previous methods across both simulation and real-world experiments. It achieves improvements of 13% on the LIBERO-LONG benchmark, 21% on CALVIN ABC-D, and 43% in real-world tasks. Notably, Seer sets a new state-of-the-art on CALVIN ABC-D benchmark, achieving an average length of 4.28, and exhibits superior generalization for novel objects, lighting conditions, and environments under high-intensity disturbances on real-world scenarios. Code and models are publicly available at https://github.com/OpenRobotLab/Seer/.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Benchmarks ; Cloning ; Datasets ; Inverse dynamics ; Robotics ; Vision</subject><ispartof>arXiv.org, 2024-12</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by-nc-sa/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Yang, Tian</creatorcontrib><creatorcontrib>Yang, Sizhe</creatorcontrib><creatorcontrib>Zeng, Jia</creatorcontrib><creatorcontrib>Wang, Ping</creatorcontrib><creatorcontrib>Lin, Dahua</creatorcontrib><creatorcontrib>Dong, Hao</creatorcontrib><creatorcontrib>Pang, Jiangmiao</creatorcontrib><title>Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation</title><title>arXiv.org</title><description>Current efforts to learn scalable policies in robotic manipulation primarily fall into two categories: one focuses on "action," which involves behavior cloning from extensive collections of robotic data, while the other emphasizes "vision," enhancing model generalization by pre-training representations or generative models, also referred to as world models, using large-scale visual datasets. This paper presents an end-to-end paradigm that predicts actions using inverse dynamics models conditioned on the robot's forecasted visual states, named Predictive Inverse Dynamics Models (PIDM). By closing the loop between vision and action, the end-to-end PIDM can be a better scalable action learner. In practice, we use Transformers to process both visual states and actions, naming the model Seer. It is initially pre-trained on large-scale robotic datasets, such as DROID, and can be adapted to realworld scenarios with a little fine-tuning data. Thanks to large-scale, end-to-end training and the synergy between vision and action, Seer significantly outperforms previous methods across both simulation and real-world experiments. It achieves improvements of 13% on the LIBERO-LONG benchmark, 21% on CALVIN ABC-D, and 43% in real-world tasks. Notably, Seer sets a new state-of-the-art on CALVIN ABC-D benchmark, achieving an average length of 4.28, and exhibits superior generalization for novel objects, lighting conditions, and environments under high-intensity disturbances on real-world scenarios. Code and models are publicly available at https://github.com/OpenRobotLab/Seer/.</description><subject>Benchmarks</subject><subject>Cloning</subject><subject>Datasets</subject><subject>Inverse dynamics</subject><subject>Robotics</subject><subject>Vision</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNyt0KgjAYgOERBEl5Dx90LOimrvN-KCjo71zm_ITJ2mybQnefB11AR-_B-8xIRBnLkk1O6YLE3ndpmtKS06JgEbldHTZKBjUinMyIziPsPka8lPRwsQ1qD8IhPKTQotYIZxTOTAxa6-BuaxuUhIswqh-0CMqaFZm3QnuMf12S9WH_3B6T3tn3gD5UnR2cmVbFspwXJWe8ZP-pL6w1P3Q</recordid><startdate>20241219</startdate><enddate>20241219</enddate><creator>Yang, Tian</creator><creator>Yang, Sizhe</creator><creator>Zeng, Jia</creator><creator>Wang, Ping</creator><creator>Lin, Dahua</creator><creator>Dong, Hao</creator><creator>Pang, Jiangmiao</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241219</creationdate><title>Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation</title><author>Yang, Tian ; Yang, Sizhe ; Zeng, Jia ; Wang, Ping ; Lin, Dahua ; Dong, Hao ; Pang, Jiangmiao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31475673763</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Benchmarks</topic><topic>Cloning</topic><topic>Datasets</topic><topic>Inverse dynamics</topic><topic>Robotics</topic><topic>Vision</topic><toplevel>online_resources</toplevel><creatorcontrib>Yang, Tian</creatorcontrib><creatorcontrib>Yang, Sizhe</creatorcontrib><creatorcontrib>Zeng, Jia</creatorcontrib><creatorcontrib>Wang, Ping</creatorcontrib><creatorcontrib>Lin, Dahua</creatorcontrib><creatorcontrib>Dong, Hao</creatorcontrib><creatorcontrib>Pang, Jiangmiao</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yang, Tian</au><au>Yang, Sizhe</au><au>Zeng, Jia</au><au>Wang, Ping</au><au>Lin, Dahua</au><au>Dong, Hao</au><au>Pang, Jiangmiao</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation</atitle><jtitle>arXiv.org</jtitle><date>2024-12-19</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Current efforts to learn scalable policies in robotic manipulation primarily fall into two categories: one focuses on "action," which involves behavior cloning from extensive collections of robotic data, while the other emphasizes "vision," enhancing model generalization by pre-training representations or generative models, also referred to as world models, using large-scale visual datasets. This paper presents an end-to-end paradigm that predicts actions using inverse dynamics models conditioned on the robot's forecasted visual states, named Predictive Inverse Dynamics Models (PIDM). By closing the loop between vision and action, the end-to-end PIDM can be a better scalable action learner. In practice, we use Transformers to process both visual states and actions, naming the model Seer. It is initially pre-trained on large-scale robotic datasets, such as DROID, and can be adapted to realworld scenarios with a little fine-tuning data. Thanks to large-scale, end-to-end training and the synergy between vision and action, Seer significantly outperforms previous methods across both simulation and real-world experiments. It achieves improvements of 13% on the LIBERO-LONG benchmark, 21% on CALVIN ABC-D, and 43% in real-world tasks. Notably, Seer sets a new state-of-the-art on CALVIN ABC-D benchmark, achieving an average length of 4.28, and exhibits superior generalization for novel objects, lighting conditions, and environments under high-intensity disturbances on real-world scenarios. Code and models are publicly available at https://github.com/OpenRobotLab/Seer/.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-12
issn 2331-8422
language eng
recordid cdi_proquest_journals_3147567376
source Free E- Journals
subjects Benchmarks
Cloning
Datasets
Inverse dynamics
Robotics
Vision
title Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T19%3A50%3A50IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Predictive%20Inverse%20Dynamics%20Models%20are%20Scalable%20Learners%20for%20Robotic%20Manipulation&rft.jtitle=arXiv.org&rft.au=Yang,%20Tian&rft.date=2024-12-19&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3147567376%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3147567376&rft_id=info:pmid/&rfr_iscdi=true