Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation

Current efforts to learn scalable policies in robotic manipulation primarily fall into two categories: one focuses on "action," which involves behavior cloning from extensive collections of robotic data, while the other emphasizes "vision," enhancing model generalization by pre-t...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-12
Hauptverfasser:	Yang, Tian, Yang, Sizhe, Zeng, Jia, Wang, Ping, Lin, Dahua, Dong, Hao, Pang, Jiangmiao
Format:	Artikel
Sprache:	eng
Schlagworte:	Benchmarks Cloning Datasets Inverse dynamics Robotics Vision
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Yang, Tian Yang, Sizhe Zeng, Jia Wang, Ping Lin, Dahua Dong, Hao Pang, Jiangmiao
description	Current efforts to learn scalable policies in robotic manipulation primarily fall into two categories: one focuses on "action," which involves behavior cloning from extensive collections of robotic data, while the other emphasizes "vision," enhancing model generalization by pre-training representations or generative models, also referred to as world models, using large-scale visual datasets. This paper presents an end-to-end paradigm that predicts actions using inverse dynamics models conditioned on the robot's forecasted visual states, named Predictive Inverse Dynamics Models (PIDM). By closing the loop between vision and action, the end-to-end PIDM can be a better scalable action learner. In practice, we use Transformers to process both visual states and actions, naming the model Seer. It is initially pre-trained on large-scale robotic datasets, such as DROID, and can be adapted to realworld scenarios with a little fine-tuning data. Thanks to large-scale, end-to-end training and the synergy between vision and action, Seer significantly outperforms previous methods across both simulation and real-world experiments. It achieves improvements of 13% on the LIBERO-LONG benchmark, 21% on CALVIN ABC-D, and 43% in real-world tasks. Notably, Seer sets a new state-of-the-art on CALVIN ABC-D benchmark, achieving an average length of 4.28, and exhibits superior generalization for novel objects, lighting conditions, and environments under high-intensity disturbances on real-world scenarios. Code and models are publicly available at https://github.com/OpenRobotLab/Seer/.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3147567376</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3147567376</sourcerecordid><originalsourceid>FETCH-proquest_journals_31475673763</originalsourceid><addsrcrecordid>eNqNyt0KgjAYgOERBEl5Dx90LOimrvN-KCjo71zm_ITJ2mybQnefB11AR-_B-8xIRBnLkk1O6YLE3ndpmtKS06JgEbldHTZKBjUinMyIziPsPka8lPRwsQ1qD8IhPKTQotYIZxTOTAxa6-BuaxuUhIswqh-0CMqaFZm3QnuMf12S9WH_3B6T3tn3gD5UnR2cmVbFspwXJWe8ZP-pL6w1P3Q</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3147567376</pqid></control><display><type>article</type><title>Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation</title><source>Free E- Journals</source><creator>Yang, Tian ; Yang, Sizhe ; Zeng, Jia ; Wang, Ping ; Lin, Dahua ; Dong, Hao ; Pang, Jiangmiao</creator><creatorcontrib>Yang, Tian ; Yang, Sizhe ; Zeng, Jia ; Wang, Ping ; Lin, Dahua ; Dong, Hao ; Pang, Jiangmiao</creatorcontrib><description>Current efforts to learn scalable policies in robotic manipulation primarily fall into two categories: one focuses on "action," which involves behavior cloning from extensive collections of robotic data, while the other emphasizes "vision," enhancing model generalization by pre-training representations or generative models, also referred to as world models, using large-scale visual datasets. This paper presents an end-to-end paradigm that predicts actions using inverse dynamics models conditioned on the robot's forecasted visual states, named Predictive Inverse Dynamics Models (PIDM). By closing the loop between vision and action, the end-to-end PIDM can be a better scalable action learner. In practice, we use Transformers to process both visual states and actions, naming the model Seer. It is initially pre-trained on large-scale robotic datasets, such as DROID, and can be adapted to realworld scenarios with a little fine-tuning data. Thanks to large-scale, end-to-end training and the synergy between vision and action, Seer significantly outperforms previous methods across both simulation and real-world experiments. It achieves improvements of 13% on the LIBERO-LONG benchmark, 21% on CALVIN ABC-D, and 43% in real-world tasks. Notably, Seer sets a new state-of-the-art on CALVIN ABC-D benchmark, achieving an average length of 4.28, and exhibits superior generalization for novel objects, lighting conditions, and environments under high-intensity disturbances on real-world scenarios. Code and models are publicly available at https://github.com/OpenRobotLab/Seer/.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Benchmarks ; Cloning ; Datasets ; Inverse dynamics ; Robotics ; Vision</subject><ispartof>arXiv.org, 2024-12</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by-nc-sa/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Yang, Tian</creatorcontrib><creatorcontrib>Yang, Sizhe</creatorcontrib><creatorcontrib>Zeng, Jia</creatorcontrib><creatorcontrib>Wang, Ping</creatorcontrib><creatorcontrib>Lin, Dahua</creatorcontrib><creatorcontrib>Dong, Hao</creatorcontrib><creatorcontrib>Pang, Jiangmiao</creatorcontrib><title>Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation</title><title>arXiv.org</title><description>Current efforts to learn scalable policies in robotic manipulation primarily fall into two categories: one focuses on "action," which involves behavior cloning from extensive collections of robotic data, while the other emphasizes "vision," enhancing model generalization by pre-training representations or generative models, also referred to as world models, using large-scale visual datasets. This paper presents an end-to-end paradigm that predicts actions using inverse dynamics models conditioned on the robot's forecasted visual states, named Predictive Inverse Dynamics Models (PIDM). By closing the loop between vision and action, the end-to-end PIDM can be a better scalable action learner. In practice, we use Transformers to process both visual states and actions, naming the model Seer. It is initially pre-trained on large-scale robotic datasets, such as DROID, and can be adapted to realworld scenarios with a little fine-tuning data. Thanks to large-scale, end-to-end training and the synergy between vision and action, Seer significantly outperforms previous methods across both simulation and real-world experiments. It achieves improvements of 13% on the LIBERO-LONG benchmark, 21% on CALVIN ABC-D, and 43% in real-world tasks. Notably, Seer sets a new state-of-the-art on CALVIN ABC-D benchmark, achieving an average length of 4.28, and exhibits superior generalization for novel objects, lighting conditions, and environments under high-intensity disturbances on real-world scenarios. Code and models are publicly available at https://github.com/OpenRobotLab/Seer/.</description><subject>Benchmarks</subject><subject>Cloning</subject><subject>Datasets</subject><subject>Inverse dynamics</subject><subject>Robotics</subject><subject>Vision</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNyt0KgjAYgOERBEl5Dx90LOimrvN-KCjo71zm_ITJ2mybQnefB11AR-_B-8xIRBnLkk1O6YLE3ndpmtKS06JgEbldHTZKBjUinMyIziPsPka8lPRwsQ1qD8IhPKTQotYIZxTOTAxa6-BuaxuUhIswqh-0CMqaFZm3QnuMf12S9WH_3B6T3tn3gD5UnR2cmVbFspwXJWe8ZP-pL6w1P3Q</recordid><startdate>20241219</startdate><enddate>20241219</enddate><creator>Yang, Tian</creator><creator>Yang, Sizhe</creator><creator>Zeng, Jia</creator><creator>Wang, Ping</creator><creator>Lin, Dahua</creator><creator>Dong, Hao</creator><creator>Pang, Jiangmiao</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241219</creationdate><title>Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation</title><author>Yang, Tian ; Yang, Sizhe ; Zeng, Jia ; Wang, Ping ; Lin, Dahua ; Dong, Hao ; Pang, Jiangmiao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31475673763</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Benchmarks</topic><topic>Cloning</topic><topic>Datasets</topic><topic>Inverse dynamics</topic><topic>Robotics</topic><topic>Vision</topic><toplevel>online_resources</toplevel><creatorcontrib>Yang, Tian</creatorcontrib><creatorcontrib>Yang, Sizhe</creatorcontrib><creatorcontrib>Zeng, Jia</creatorcontrib><creatorcontrib>Wang, Ping</creatorcontrib><creatorcontrib>Lin, Dahua</creatorcontrib><creatorcontrib>Dong, Hao</creatorcontrib><creatorcontrib>Pang, Jiangmiao</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yang, Tian</au><au>Yang, Sizhe</au><au>Zeng, Jia</au><au>Wang, Ping</au><au>Lin, Dahua</au><au>Dong, Hao</au><au>Pang, Jiangmiao</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation</atitle><jtitle>arXiv.org</jtitle><date>2024-12-19</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Current efforts to learn scalable policies in robotic manipulation primarily fall into two categories: one focuses on "action," which involves behavior cloning from extensive collections of robotic data, while the other emphasizes "vision," enhancing model generalization by pre-training representations or generative models, also referred to as world models, using large-scale visual datasets. This paper presents an end-to-end paradigm that predicts actions using inverse dynamics models conditioned on the robot's forecasted visual states, named Predictive Inverse Dynamics Models (PIDM). By closing the loop between vision and action, the end-to-end PIDM can be a better scalable action learner. In practice, we use Transformers to process both visual states and actions, naming the model Seer. It is initially pre-trained on large-scale robotic datasets, such as DROID, and can be adapted to realworld scenarios with a little fine-tuning data. Thanks to large-scale, end-to-end training and the synergy between vision and action, Seer significantly outperforms previous methods across both simulation and real-world experiments. It achieves improvements of 13% on the LIBERO-LONG benchmark, 21% on CALVIN ABC-D, and 43% in real-world tasks. Notably, Seer sets a new state-of-the-art on CALVIN ABC-D benchmark, achieving an average length of 4.28, and exhibits superior generalization for novel objects, lighting conditions, and environments under high-intensity disturbances on real-world scenarios. Code and models are publicly available at https://github.com/OpenRobotLab/Seer/.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-12
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3147567376
source	Free E- Journals
subjects	Benchmarks Cloning Datasets Inverse dynamics Robotics Vision
title	Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T19%3A50%3A50IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Predictive%20Inverse%20Dynamics%20Models%20are%20Scalable%20Learners%20for%20Robotic%20Manipulation&rft.jtitle=arXiv.org&rft.au=Yang,%20Tian&rft.date=2024-12-19&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3147567376%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3147567376&rft_id=info:pmid/&rfr_iscdi=true