ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?

Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these ap...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-12
Hauptverfasser: Kim, Taewhan, Bae, Hojin, Li, Zeming, Li, Xiaoqi, Ponomarenko, Iaroslav, Wu, Ruihai, Dong, Hao
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Kim, Taewhan
Bae, Hojin
Li, Zeming
Li, Xiaoqi
Ponomarenko, Iaroslav
Wu, Ruihai
Dong, Hao
description Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these approaches are computationally intensive and struggle to adapt to diverse and dynamic environments. This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects using a large pre-trained vision transformer (ViT). We created a dataset of 9.9k simulated and real images to bridge the sim-to-real gap and enhance real-world applicability. By fine-tuning the vision transformer on this small dataset, we significantly improved part-level affordance segmentation, adapting the model's in-context segmentation capabilities to robot manipulation scenarios. This enables effective manipulation across simulated and real-world environments by generating part-level affordance masks, paired with an impedance adaptation policy, sufficiently eliminating the need for complex datasets or perception systems.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3145273022</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3145273022</sourcerecordid><originalsourceid>FETCH-proquest_journals_31452730223</originalsourceid><addsrcrecordid>eNqNjMsKgkAYhYcgSMp3-KG1oDNa0SYk7AJFQdJWJv21EZuxmXHR22fRA7Q6nMt3BsShjAXeIqR0RFxjat_36WxOo4g5pDpyKdrtOV3C3kBclkoXXOYIF6weKC23Qkm4veDAdYVwFebjj6rAxkAiVVfdoWcg1lbkXcMtFnC61Zhb-D5_op5YTciw5I1B96djMt0k6XrntVo9OzQ2q1WnZV9lLAgjOmc-pey_1RtzTUab</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3145273022</pqid></control><display><type>article</type><title>ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?</title><source>Free E- Journals</source><creator>Kim, Taewhan ; Bae, Hojin ; Li, Zeming ; Li, Xiaoqi ; Ponomarenko, Iaroslav ; Wu, Ruihai ; Dong, Hao</creator><creatorcontrib>Kim, Taewhan ; Bae, Hojin ; Li, Zeming ; Li, Xiaoqi ; Ponomarenko, Iaroslav ; Wu, Ruihai ; Dong, Hao</creatorcontrib><description>Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these approaches are computationally intensive and struggle to adapt to diverse and dynamic environments. This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects using a large pre-trained vision transformer (ViT). We created a dataset of 9.9k simulated and real images to bridge the sim-to-real gap and enhance real-world applicability. By fine-tuning the vision transformer on this small dataset, we significantly improved part-level affordance segmentation, adapting the model's in-context segmentation capabilities to robot manipulation scenarios. This enables effective manipulation across simulated and real-world environments by generating part-level affordance masks, paired with an impedance adaptation policy, sufficiently eliminating the need for complex datasets or perception systems.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Datasets ; Image manipulation ; Robotics ; Vision</subject><ispartof>arXiv.org, 2024-12</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>778,782</link.rule.ids></links><search><creatorcontrib>Kim, Taewhan</creatorcontrib><creatorcontrib>Bae, Hojin</creatorcontrib><creatorcontrib>Li, Zeming</creatorcontrib><creatorcontrib>Li, Xiaoqi</creatorcontrib><creatorcontrib>Ponomarenko, Iaroslav</creatorcontrib><creatorcontrib>Wu, Ruihai</creatorcontrib><creatorcontrib>Dong, Hao</creatorcontrib><title>ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?</title><title>arXiv.org</title><description>Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these approaches are computationally intensive and struggle to adapt to diverse and dynamic environments. This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects using a large pre-trained vision transformer (ViT). We created a dataset of 9.9k simulated and real images to bridge the sim-to-real gap and enhance real-world applicability. By fine-tuning the vision transformer on this small dataset, we significantly improved part-level affordance segmentation, adapting the model's in-context segmentation capabilities to robot manipulation scenarios. This enables effective manipulation across simulated and real-world environments by generating part-level affordance masks, paired with an impedance adaptation policy, sufficiently eliminating the need for complex datasets or perception systems.</description><subject>Datasets</subject><subject>Image manipulation</subject><subject>Robotics</subject><subject>Vision</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNjMsKgkAYhYcgSMp3-KG1oDNa0SYk7AJFQdJWJv21EZuxmXHR22fRA7Q6nMt3BsShjAXeIqR0RFxjat_36WxOo4g5pDpyKdrtOV3C3kBclkoXXOYIF6weKC23Qkm4veDAdYVwFebjj6rAxkAiVVfdoWcg1lbkXcMtFnC61Zhb-D5_op5YTciw5I1B96djMt0k6XrntVo9OzQ2q1WnZV9lLAgjOmc-pey_1RtzTUab</recordid><startdate>20241218</startdate><enddate>20241218</enddate><creator>Kim, Taewhan</creator><creator>Bae, Hojin</creator><creator>Li, Zeming</creator><creator>Li, Xiaoqi</creator><creator>Ponomarenko, Iaroslav</creator><creator>Wu, Ruihai</creator><creator>Dong, Hao</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241218</creationdate><title>ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?</title><author>Kim, Taewhan ; Bae, Hojin ; Li, Zeming ; Li, Xiaoqi ; Ponomarenko, Iaroslav ; Wu, Ruihai ; Dong, Hao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31452730223</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Datasets</topic><topic>Image manipulation</topic><topic>Robotics</topic><topic>Vision</topic><toplevel>online_resources</toplevel><creatorcontrib>Kim, Taewhan</creatorcontrib><creatorcontrib>Bae, Hojin</creatorcontrib><creatorcontrib>Li, Zeming</creatorcontrib><creatorcontrib>Li, Xiaoqi</creatorcontrib><creatorcontrib>Ponomarenko, Iaroslav</creatorcontrib><creatorcontrib>Wu, Ruihai</creatorcontrib><creatorcontrib>Dong, Hao</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kim, Taewhan</au><au>Bae, Hojin</au><au>Li, Zeming</au><au>Li, Xiaoqi</au><au>Ponomarenko, Iaroslav</au><au>Wu, Ruihai</au><au>Dong, Hao</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?</atitle><jtitle>arXiv.org</jtitle><date>2024-12-18</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these approaches are computationally intensive and struggle to adapt to diverse and dynamic environments. This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects using a large pre-trained vision transformer (ViT). We created a dataset of 9.9k simulated and real images to bridge the sim-to-real gap and enhance real-world applicability. By fine-tuning the vision transformer on this small dataset, we significantly improved part-level affordance segmentation, adapting the model's in-context segmentation capabilities to robot manipulation scenarios. This enables effective manipulation across simulated and real-world environments by generating part-level affordance masks, paired with an impedance adaptation policy, sufficiently eliminating the need for complex datasets or perception systems.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-12
issn 2331-8422
language eng
recordid cdi_proquest_journals_3145273022
source Free E- Journals
subjects Datasets
Image manipulation
Robotics
Vision
title ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-15T09%3A38%3A11IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=ManipGPT:%20Is%20Affordance%20Segmentation%20by%20Large%20Vision%20Models%20Enough%20for%20Articulated%20Object%20Manipulation?&rft.jtitle=arXiv.org&rft.au=Kim,%20Taewhan&rft.date=2024-12-18&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3145273022%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3145273022&rft_id=info:pmid/&rfr_iscdi=true