ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?

Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these ap...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-12
Hauptverfasser:	Kim, Taewhan, Bae, Hojin, Li, Zeming, Li, Xiaoqi, Ponomarenko, Iaroslav, Wu, Ruihai, Dong, Hao
Format:	Artikel
Sprache:	eng
Schlagworte:	Datasets Image manipulation Robotics Vision
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Kim, Taewhan Bae, Hojin Li, Zeming Li, Xiaoqi Ponomarenko, Iaroslav Wu, Ruihai Dong, Hao
description	Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these approaches are computationally intensive and struggle to adapt to diverse and dynamic environments. This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects using a large pre-trained vision transformer (ViT). We created a dataset of 9.9k simulated and real images to bridge the sim-to-real gap and enhance real-world applicability. By fine-tuning the vision transformer on this small dataset, we significantly improved part-level affordance segmentation, adapting the model's in-context segmentation capabilities to robot manipulation scenarios. This enables effective manipulation across simulated and real-world environments by generating part-level affordance masks, paired with an impedance adaptation policy, sufficiently eliminating the need for complex datasets or perception systems.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3145273022</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3145273022</sourcerecordid><originalsourceid>FETCH-proquest_journals_31452730223</originalsourceid><addsrcrecordid>eNqNjMsKgkAYhYcgSMp3-KG1oDNa0SYk7AJFQdJWJv21EZuxmXHR22fRA7Q6nMt3BsShjAXeIqR0RFxjat_36WxOo4g5pDpyKdrtOV3C3kBclkoXXOYIF6weKC23Qkm4veDAdYVwFebjj6rAxkAiVVfdoWcg1lbkXcMtFnC61Zhb-D5_op5YTciw5I1B96djMt0k6XrntVo9OzQ2q1WnZV9lLAgjOmc-pey_1RtzTUab</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3145273022</pqid></control><display><type>article</type><title>ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?</title><source>Free E- Journals</source><creator>Kim, Taewhan ; Bae, Hojin ; Li, Zeming ; Li, Xiaoqi ; Ponomarenko, Iaroslav ; Wu, Ruihai ; Dong, Hao</creator><creatorcontrib>Kim, Taewhan ; Bae, Hojin ; Li, Zeming ; Li, Xiaoqi ; Ponomarenko, Iaroslav ; Wu, Ruihai ; Dong, Hao</creatorcontrib><description>Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these approaches are computationally intensive and struggle to adapt to diverse and dynamic environments. This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects using a large pre-trained vision transformer (ViT). We created a dataset of 9.9k simulated and real images to bridge the sim-to-real gap and enhance real-world applicability. By fine-tuning the vision transformer on this small dataset, we significantly improved part-level affordance segmentation, adapting the model's in-context segmentation capabilities to robot manipulation scenarios. This enables effective manipulation across simulated and real-world environments by generating part-level affordance masks, paired with an impedance adaptation policy, sufficiently eliminating the need for complex datasets or perception systems.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Datasets ; Image manipulation ; Robotics ; Vision</subject><ispartof>arXiv.org, 2024-12</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>778,782</link.rule.ids></links><search><creatorcontrib>Kim, Taewhan</creatorcontrib><creatorcontrib>Bae, Hojin</creatorcontrib><creatorcontrib>Li, Zeming</creatorcontrib><creatorcontrib>Li, Xiaoqi</creatorcontrib><creatorcontrib>Ponomarenko, Iaroslav</creatorcontrib><creatorcontrib>Wu, Ruihai</creatorcontrib><creatorcontrib>Dong, Hao</creatorcontrib><title>ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?</title><title>arXiv.org</title><description>Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these approaches are computationally intensive and struggle to adapt to diverse and dynamic environments. This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects using a large pre-trained vision transformer (ViT). We created a dataset of 9.9k simulated and real images to bridge the sim-to-real gap and enhance real-world applicability. By fine-tuning the vision transformer on this small dataset, we significantly improved part-level affordance segmentation, adapting the model's in-context segmentation capabilities to robot manipulation scenarios. This enables effective manipulation across simulated and real-world environments by generating part-level affordance masks, paired with an impedance adaptation policy, sufficiently eliminating the need for complex datasets or perception systems.</description><subject>Datasets</subject><subject>Image manipulation</subject><subject>Robotics</subject><subject>Vision</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNjMsKgkAYhYcgSMp3-KG1oDNa0SYk7AJFQdJWJv21EZuxmXHR22fRA7Q6nMt3BsShjAXeIqR0RFxjat_36WxOo4g5pDpyKdrtOV3C3kBclkoXXOYIF6weKC23Qkm4veDAdYVwFebjj6rAxkAiVVfdoWcg1lbkXcMtFnC61Zhb-D5_op5YTciw5I1B96djMt0k6XrntVo9OzQ2q1WnZV9lLAgjOmc-pey_1RtzTUab</recordid><startdate>20241218</startdate><enddate>20241218</enddate><creator>Kim, Taewhan</creator><creator>Bae, Hojin</creator><creator>Li, Zeming</creator><creator>Li, Xiaoqi</creator><creator>Ponomarenko, Iaroslav</creator><creator>Wu, Ruihai</creator><creator>Dong, Hao</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241218</creationdate><title>ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?</title><author>Kim, Taewhan ; Bae, Hojin ; Li, Zeming ; Li, Xiaoqi ; Ponomarenko, Iaroslav ; Wu, Ruihai ; Dong, Hao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31452730223</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Datasets</topic><topic>Image manipulation</topic><topic>Robotics</topic><topic>Vision</topic><toplevel>online_resources</toplevel><creatorcontrib>Kim, Taewhan</creatorcontrib><creatorcontrib>Bae, Hojin</creatorcontrib><creatorcontrib>Li, Zeming</creatorcontrib><creatorcontrib>Li, Xiaoqi</creatorcontrib><creatorcontrib>Ponomarenko, Iaroslav</creatorcontrib><creatorcontrib>Wu, Ruihai</creatorcontrib><creatorcontrib>Dong, Hao</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kim, Taewhan</au><au>Bae, Hojin</au><au>Li, Zeming</au><au>Li, Xiaoqi</au><au>Ponomarenko, Iaroslav</au><au>Wu, Ruihai</au><au>Dong, Hao</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?</atitle><jtitle>arXiv.org</jtitle><date>2024-12-18</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these approaches are computationally intensive and struggle to adapt to diverse and dynamic environments. This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects using a large pre-trained vision transformer (ViT). We created a dataset of 9.9k simulated and real images to bridge the sim-to-real gap and enhance real-world applicability. By fine-tuning the vision transformer on this small dataset, we significantly improved part-level affordance segmentation, adapting the model's in-context segmentation capabilities to robot manipulation scenarios. This enables effective manipulation across simulated and real-world environments by generating part-level affordance masks, paired with an impedance adaptation policy, sufficiently eliminating the need for complex datasets or perception systems.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-12
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3145273022
source	Free E- Journals
subjects	Datasets Image manipulation Robotics Vision
title	ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-15T09%3A38%3A11IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=ManipGPT:%20Is%20Affordance%20Segmentation%20by%20Large%20Vision%20Models%20Enough%20for%20Articulated%20Object%20Manipulation?&rft.jtitle=arXiv.org&rft.au=Kim,%20Taewhan&rft.date=2024-12-18&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3145273022%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3145273022&rft_id=info:pmid/&rfr_iscdi=true