Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation
Current efforts to learn scalable policies in robotic manipulation primarily fall into two categories: one focuses on "action," which involves behavior cloning from extensive collections of robotic data, while the other emphasizes "vision," enhancing model generalization by pre-t...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2024-12 |
---|---|
Hauptverfasser: | , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Yang, Tian Yang, Sizhe Zeng, Jia Wang, Ping Lin, Dahua Dong, Hao Pang, Jiangmiao |
description | Current efforts to learn scalable policies in robotic manipulation primarily fall into two categories: one focuses on "action," which involves behavior cloning from extensive collections of robotic data, while the other emphasizes "vision," enhancing model generalization by pre-training representations or generative models, also referred to as world models, using large-scale visual datasets. This paper presents an end-to-end paradigm that predicts actions using inverse dynamics models conditioned on the robot's forecasted visual states, named Predictive Inverse Dynamics Models (PIDM). By closing the loop between vision and action, the end-to-end PIDM can be a better scalable action learner. In practice, we use Transformers to process both visual states and actions, naming the model Seer. It is initially pre-trained on large-scale robotic datasets, such as DROID, and can be adapted to realworld scenarios with a little fine-tuning data. Thanks to large-scale, end-to-end training and the synergy between vision and action, Seer significantly outperforms previous methods across both simulation and real-world experiments. It achieves improvements of 13% on the LIBERO-LONG benchmark, 21% on CALVIN ABC-D, and 43% in real-world tasks. Notably, Seer sets a new state-of-the-art on CALVIN ABC-D benchmark, achieving an average length of 4.28, and exhibits superior generalization for novel objects, lighting conditions, and environments under high-intensity disturbances on real-world scenarios. Code and models are publicly available at https://github.com/OpenRobotLab/Seer/. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3147567376</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3147567376</sourcerecordid><originalsourceid>FETCH-proquest_journals_31475673763</originalsourceid><addsrcrecordid>eNqNyt0KgjAYgOERBEl5Dx90LOimrvN-KCjo71zm_ITJ2mybQnefB11AR-_B-8xIRBnLkk1O6YLE3ndpmtKS06JgEbldHTZKBjUinMyIziPsPka8lPRwsQ1qD8IhPKTQotYIZxTOTAxa6-BuaxuUhIswqh-0CMqaFZm3QnuMf12S9WH_3B6T3tn3gD5UnR2cmVbFspwXJWe8ZP-pL6w1P3Q</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3147567376</pqid></control><display><type>article</type><title>Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation</title><source>Free E- Journals</source><creator>Yang, Tian ; Yang, Sizhe ; Zeng, Jia ; Wang, Ping ; Lin, Dahua ; Dong, Hao ; Pang, Jiangmiao</creator><creatorcontrib>Yang, Tian ; Yang, Sizhe ; Zeng, Jia ; Wang, Ping ; Lin, Dahua ; Dong, Hao ; Pang, Jiangmiao</creatorcontrib><description>Current efforts to learn scalable policies in robotic manipulation primarily fall into two categories: one focuses on "action," which involves behavior cloning from extensive collections of robotic data, while the other emphasizes "vision," enhancing model generalization by pre-training representations or generative models, also referred to as world models, using large-scale visual datasets. This paper presents an end-to-end paradigm that predicts actions using inverse dynamics models conditioned on the robot's forecasted visual states, named Predictive Inverse Dynamics Models (PIDM). By closing the loop between vision and action, the end-to-end PIDM can be a better scalable action learner. In practice, we use Transformers to process both visual states and actions, naming the model Seer. It is initially pre-trained on large-scale robotic datasets, such as DROID, and can be adapted to realworld scenarios with a little fine-tuning data. Thanks to large-scale, end-to-end training and the synergy between vision and action, Seer significantly outperforms previous methods across both simulation and real-world experiments. It achieves improvements of 13% on the LIBERO-LONG benchmark, 21% on CALVIN ABC-D, and 43% in real-world tasks. Notably, Seer sets a new state-of-the-art on CALVIN ABC-D benchmark, achieving an average length of 4.28, and exhibits superior generalization for novel objects, lighting conditions, and environments under high-intensity disturbances on real-world scenarios. Code and models are publicly available at https://github.com/OpenRobotLab/Seer/.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Benchmarks ; Cloning ; Datasets ; Inverse dynamics ; Robotics ; Vision</subject><ispartof>arXiv.org, 2024-12</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by-nc-sa/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Yang, Tian</creatorcontrib><creatorcontrib>Yang, Sizhe</creatorcontrib><creatorcontrib>Zeng, Jia</creatorcontrib><creatorcontrib>Wang, Ping</creatorcontrib><creatorcontrib>Lin, Dahua</creatorcontrib><creatorcontrib>Dong, Hao</creatorcontrib><creatorcontrib>Pang, Jiangmiao</creatorcontrib><title>Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation</title><title>arXiv.org</title><description>Current efforts to learn scalable policies in robotic manipulation primarily fall into two categories: one focuses on "action," which involves behavior cloning from extensive collections of robotic data, while the other emphasizes "vision," enhancing model generalization by pre-training representations or generative models, also referred to as world models, using large-scale visual datasets. This paper presents an end-to-end paradigm that predicts actions using inverse dynamics models conditioned on the robot's forecasted visual states, named Predictive Inverse Dynamics Models (PIDM). By closing the loop between vision and action, the end-to-end PIDM can be a better scalable action learner. In practice, we use Transformers to process both visual states and actions, naming the model Seer. It is initially pre-trained on large-scale robotic datasets, such as DROID, and can be adapted to realworld scenarios with a little fine-tuning data. Thanks to large-scale, end-to-end training and the synergy between vision and action, Seer significantly outperforms previous methods across both simulation and real-world experiments. It achieves improvements of 13% on the LIBERO-LONG benchmark, 21% on CALVIN ABC-D, and 43% in real-world tasks. Notably, Seer sets a new state-of-the-art on CALVIN ABC-D benchmark, achieving an average length of 4.28, and exhibits superior generalization for novel objects, lighting conditions, and environments under high-intensity disturbances on real-world scenarios. Code and models are publicly available at https://github.com/OpenRobotLab/Seer/.</description><subject>Benchmarks</subject><subject>Cloning</subject><subject>Datasets</subject><subject>Inverse dynamics</subject><subject>Robotics</subject><subject>Vision</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNyt0KgjAYgOERBEl5Dx90LOimrvN-KCjo71zm_ITJ2mybQnefB11AR-_B-8xIRBnLkk1O6YLE3ndpmtKS06JgEbldHTZKBjUinMyIziPsPka8lPRwsQ1qD8IhPKTQotYIZxTOTAxa6-BuaxuUhIswqh-0CMqaFZm3QnuMf12S9WH_3B6T3tn3gD5UnR2cmVbFspwXJWe8ZP-pL6w1P3Q</recordid><startdate>20241219</startdate><enddate>20241219</enddate><creator>Yang, Tian</creator><creator>Yang, Sizhe</creator><creator>Zeng, Jia</creator><creator>Wang, Ping</creator><creator>Lin, Dahua</creator><creator>Dong, Hao</creator><creator>Pang, Jiangmiao</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241219</creationdate><title>Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation</title><author>Yang, Tian ; Yang, Sizhe ; Zeng, Jia ; Wang, Ping ; Lin, Dahua ; Dong, Hao ; Pang, Jiangmiao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31475673763</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Benchmarks</topic><topic>Cloning</topic><topic>Datasets</topic><topic>Inverse dynamics</topic><topic>Robotics</topic><topic>Vision</topic><toplevel>online_resources</toplevel><creatorcontrib>Yang, Tian</creatorcontrib><creatorcontrib>Yang, Sizhe</creatorcontrib><creatorcontrib>Zeng, Jia</creatorcontrib><creatorcontrib>Wang, Ping</creatorcontrib><creatorcontrib>Lin, Dahua</creatorcontrib><creatorcontrib>Dong, Hao</creatorcontrib><creatorcontrib>Pang, Jiangmiao</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yang, Tian</au><au>Yang, Sizhe</au><au>Zeng, Jia</au><au>Wang, Ping</au><au>Lin, Dahua</au><au>Dong, Hao</au><au>Pang, Jiangmiao</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation</atitle><jtitle>arXiv.org</jtitle><date>2024-12-19</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Current efforts to learn scalable policies in robotic manipulation primarily fall into two categories: one focuses on "action," which involves behavior cloning from extensive collections of robotic data, while the other emphasizes "vision," enhancing model generalization by pre-training representations or generative models, also referred to as world models, using large-scale visual datasets. This paper presents an end-to-end paradigm that predicts actions using inverse dynamics models conditioned on the robot's forecasted visual states, named Predictive Inverse Dynamics Models (PIDM). By closing the loop between vision and action, the end-to-end PIDM can be a better scalable action learner. In practice, we use Transformers to process both visual states and actions, naming the model Seer. It is initially pre-trained on large-scale robotic datasets, such as DROID, and can be adapted to realworld scenarios with a little fine-tuning data. Thanks to large-scale, end-to-end training and the synergy between vision and action, Seer significantly outperforms previous methods across both simulation and real-world experiments. It achieves improvements of 13% on the LIBERO-LONG benchmark, 21% on CALVIN ABC-D, and 43% in real-world tasks. Notably, Seer sets a new state-of-the-art on CALVIN ABC-D benchmark, achieving an average length of 4.28, and exhibits superior generalization for novel objects, lighting conditions, and environments under high-intensity disturbances on real-world scenarios. Code and models are publicly available at https://github.com/OpenRobotLab/Seer/.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-12 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_3147567376 |
source | Free E- Journals |
subjects | Benchmarks Cloning Datasets Inverse dynamics Robotics Vision |
title | Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T19%3A50%3A50IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Predictive%20Inverse%20Dynamics%20Models%20are%20Scalable%20Learners%20for%20Robotic%20Manipulation&rft.jtitle=arXiv.org&rft.au=Yang,%20Tian&rft.date=2024-12-19&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3147567376%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3147567376&rft_id=info:pmid/&rfr_iscdi=true |