ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?
Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these ap...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2024-12 |
---|---|
Hauptverfasser: | , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Kim, Taewhan Bae, Hojin Li, Zeming Li, Xiaoqi Ponomarenko, Iaroslav Wu, Ruihai Dong, Hao |
description | Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these approaches are computationally intensive and struggle to adapt to diverse and dynamic environments. This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects using a large pre-trained vision transformer (ViT). We created a dataset of 9.9k simulated and real images to bridge the sim-to-real gap and enhance real-world applicability. By fine-tuning the vision transformer on this small dataset, we significantly improved part-level affordance segmentation, adapting the model's in-context segmentation capabilities to robot manipulation scenarios. This enables effective manipulation across simulated and real-world environments by generating part-level affordance masks, paired with an impedance adaptation policy, sufficiently eliminating the need for complex datasets or perception systems. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3145273022</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3145273022</sourcerecordid><originalsourceid>FETCH-proquest_journals_31452730223</originalsourceid><addsrcrecordid>eNqNjMsKgkAYhYcgSMp3-KG1oDNa0SYk7AJFQdJWJv21EZuxmXHR22fRA7Q6nMt3BsShjAXeIqR0RFxjat_36WxOo4g5pDpyKdrtOV3C3kBclkoXXOYIF6weKC23Qkm4veDAdYVwFebjj6rAxkAiVVfdoWcg1lbkXcMtFnC61Zhb-D5_op5YTciw5I1B96djMt0k6XrntVo9OzQ2q1WnZV9lLAgjOmc-pey_1RtzTUab</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3145273022</pqid></control><display><type>article</type><title>ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?</title><source>Free E- Journals</source><creator>Kim, Taewhan ; Bae, Hojin ; Li, Zeming ; Li, Xiaoqi ; Ponomarenko, Iaroslav ; Wu, Ruihai ; Dong, Hao</creator><creatorcontrib>Kim, Taewhan ; Bae, Hojin ; Li, Zeming ; Li, Xiaoqi ; Ponomarenko, Iaroslav ; Wu, Ruihai ; Dong, Hao</creatorcontrib><description>Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these approaches are computationally intensive and struggle to adapt to diverse and dynamic environments. This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects using a large pre-trained vision transformer (ViT). We created a dataset of 9.9k simulated and real images to bridge the sim-to-real gap and enhance real-world applicability. By fine-tuning the vision transformer on this small dataset, we significantly improved part-level affordance segmentation, adapting the model's in-context segmentation capabilities to robot manipulation scenarios. This enables effective manipulation across simulated and real-world environments by generating part-level affordance masks, paired with an impedance adaptation policy, sufficiently eliminating the need for complex datasets or perception systems.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Datasets ; Image manipulation ; Robotics ; Vision</subject><ispartof>arXiv.org, 2024-12</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>778,782</link.rule.ids></links><search><creatorcontrib>Kim, Taewhan</creatorcontrib><creatorcontrib>Bae, Hojin</creatorcontrib><creatorcontrib>Li, Zeming</creatorcontrib><creatorcontrib>Li, Xiaoqi</creatorcontrib><creatorcontrib>Ponomarenko, Iaroslav</creatorcontrib><creatorcontrib>Wu, Ruihai</creatorcontrib><creatorcontrib>Dong, Hao</creatorcontrib><title>ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?</title><title>arXiv.org</title><description>Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these approaches are computationally intensive and struggle to adapt to diverse and dynamic environments. This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects using a large pre-trained vision transformer (ViT). We created a dataset of 9.9k simulated and real images to bridge the sim-to-real gap and enhance real-world applicability. By fine-tuning the vision transformer on this small dataset, we significantly improved part-level affordance segmentation, adapting the model's in-context segmentation capabilities to robot manipulation scenarios. This enables effective manipulation across simulated and real-world environments by generating part-level affordance masks, paired with an impedance adaptation policy, sufficiently eliminating the need for complex datasets or perception systems.</description><subject>Datasets</subject><subject>Image manipulation</subject><subject>Robotics</subject><subject>Vision</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNjMsKgkAYhYcgSMp3-KG1oDNa0SYk7AJFQdJWJv21EZuxmXHR22fRA7Q6nMt3BsShjAXeIqR0RFxjat_36WxOo4g5pDpyKdrtOV3C3kBclkoXXOYIF6weKC23Qkm4veDAdYVwFebjj6rAxkAiVVfdoWcg1lbkXcMtFnC61Zhb-D5_op5YTciw5I1B96djMt0k6XrntVo9OzQ2q1WnZV9lLAgjOmc-pey_1RtzTUab</recordid><startdate>20241218</startdate><enddate>20241218</enddate><creator>Kim, Taewhan</creator><creator>Bae, Hojin</creator><creator>Li, Zeming</creator><creator>Li, Xiaoqi</creator><creator>Ponomarenko, Iaroslav</creator><creator>Wu, Ruihai</creator><creator>Dong, Hao</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241218</creationdate><title>ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?</title><author>Kim, Taewhan ; Bae, Hojin ; Li, Zeming ; Li, Xiaoqi ; Ponomarenko, Iaroslav ; Wu, Ruihai ; Dong, Hao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31452730223</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Datasets</topic><topic>Image manipulation</topic><topic>Robotics</topic><topic>Vision</topic><toplevel>online_resources</toplevel><creatorcontrib>Kim, Taewhan</creatorcontrib><creatorcontrib>Bae, Hojin</creatorcontrib><creatorcontrib>Li, Zeming</creatorcontrib><creatorcontrib>Li, Xiaoqi</creatorcontrib><creatorcontrib>Ponomarenko, Iaroslav</creatorcontrib><creatorcontrib>Wu, Ruihai</creatorcontrib><creatorcontrib>Dong, Hao</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kim, Taewhan</au><au>Bae, Hojin</au><au>Li, Zeming</au><au>Li, Xiaoqi</au><au>Ponomarenko, Iaroslav</au><au>Wu, Ruihai</au><au>Dong, Hao</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?</atitle><jtitle>arXiv.org</jtitle><date>2024-12-18</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these approaches are computationally intensive and struggle to adapt to diverse and dynamic environments. This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects using a large pre-trained vision transformer (ViT). We created a dataset of 9.9k simulated and real images to bridge the sim-to-real gap and enhance real-world applicability. By fine-tuning the vision transformer on this small dataset, we significantly improved part-level affordance segmentation, adapting the model's in-context segmentation capabilities to robot manipulation scenarios. This enables effective manipulation across simulated and real-world environments by generating part-level affordance masks, paired with an impedance adaptation policy, sufficiently eliminating the need for complex datasets or perception systems.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-12 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_3145273022 |
source | Free E- Journals |
subjects | Datasets Image manipulation Robotics Vision |
title | ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation? |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-15T09%3A38%3A11IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=ManipGPT:%20Is%20Affordance%20Segmentation%20by%20Large%20Vision%20Models%20Enough%20for%20Articulated%20Object%20Manipulation?&rft.jtitle=arXiv.org&rft.au=Kim,%20Taewhan&rft.date=2024-12-18&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3145273022%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3145273022&rft_id=info:pmid/&rfr_iscdi=true |