SAGE: Bridging Semantic and Actionable Parts for GEneralizable Manipulation of Articulated Objects

To interact with daily-life articulated objects of diverse structures and functionalities, understanding the object parts plays a central role in both user instruction comprehension and task execution. However, the possible discordance between the semantic meaning and physics functionalities of the...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Geng, Haoran, Wei, Songlin, Deng, Congyue, Shen, Bokui, Wang, He, Guibas, Leonidas
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition Computer Science - Robotics
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Geng, Haoran Wei, Songlin Deng, Congyue Shen, Bokui Wang, He Guibas, Leonidas
description	To interact with daily-life articulated objects of diverse structures and functionalities, understanding the object parts plays a central role in both user instruction comprehension and task execution. However, the possible discordance between the semantic meaning and physics functionalities of the parts poses a challenge for designing a general system. To address this problem, we propose SAGE, a novel framework that bridges semantic and actionable parts of articulated objects to achieve generalizable manipulation under natural language instructions. More concretely, given an articulated object, we first observe all the semantic parts on it, conditioned on which an instruction interpreter proposes possible action programs that concretize the natural language instruction. Then, a part-grounding module maps the semantic parts into so-called Generalizable Actionable Parts (GAParts), which inherently carry information about part motion. End-effector trajectories are predicted on the GAParts, which, together with the action program, form an executable policy. Additionally, an interactive feedback module is incorporated to respond to failures, which closes the loop and increases the robustness of the overall framework. Key to the success of our framework is the joint proposal and knowledge fusion between a large vision-language model (VLM) and a small domain-specific model for both context comprehension and part perception, with the former providing general intuitions and the latter serving as expert facts. Both simulation and real-robot experiments show our effectiveness in handling a large variety of articulated objects with diverse language-instructed goals.
doi_str_mv	10.48550/arxiv.2312.01307
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2312_01307</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2312_01307</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2312_013073</originalsourceid><addsrcrecordid>eNqFjsEKgkAURWfTIqoPaNX7gWzMpGhnYbaJAtvLU0d5MY7ynKL6-lLat7rcy7lwhJi60lltfF8ukJ_0cJaeu3Sk68n1UKRxEIVb2DHlJZkSYlWhsZQBmhyCzFJtMNUKLsi2haJmiEKjGDW9-_2Ehpq7xg6EuoCAv-euqxzO6U1lth2LQYG6VZNfjsTsEF73x3lvkzRMFfIr6ayS3sr7T3wAZKpCcA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>SAGE: Bridging Semantic and Actionable Parts for GEneralizable Manipulation of Articulated Objects</title><source>arXiv.org</source><creator>Geng, Haoran ; Wei, Songlin ; Deng, Congyue ; Shen, Bokui ; Wang, He ; Guibas, Leonidas</creator><creatorcontrib>Geng, Haoran ; Wei, Songlin ; Deng, Congyue ; Shen, Bokui ; Wang, He ; Guibas, Leonidas</creatorcontrib><description>To interact with daily-life articulated objects of diverse structures and functionalities, understanding the object parts plays a central role in both user instruction comprehension and task execution. However, the possible discordance between the semantic meaning and physics functionalities of the parts poses a challenge for designing a general system. To address this problem, we propose SAGE, a novel framework that bridges semantic and actionable parts of articulated objects to achieve generalizable manipulation under natural language instructions. More concretely, given an articulated object, we first observe all the semantic parts on it, conditioned on which an instruction interpreter proposes possible action programs that concretize the natural language instruction. Then, a part-grounding module maps the semantic parts into so-called Generalizable Actionable Parts (GAParts), which inherently carry information about part motion. End-effector trajectories are predicted on the GAParts, which, together with the action program, form an executable policy. Additionally, an interactive feedback module is incorporated to respond to failures, which closes the loop and increases the robustness of the overall framework. Key to the success of our framework is the joint proposal and knowledge fusion between a large vision-language model (VLM) and a small domain-specific model for both context comprehension and part perception, with the former providing general intuitions and the latter serving as expert facts. Both simulation and real-robot experiments show our effectiveness in handling a large variety of articulated objects with diverse language-instructed goals.</description><identifier>DOI: 10.48550/arxiv.2312.01307</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Robotics</subject><creationdate>2023-12</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,777,882</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2312.01307$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2312.01307$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Geng, Haoran</creatorcontrib><creatorcontrib>Wei, Songlin</creatorcontrib><creatorcontrib>Deng, Congyue</creatorcontrib><creatorcontrib>Shen, Bokui</creatorcontrib><creatorcontrib>Wang, He</creatorcontrib><creatorcontrib>Guibas, Leonidas</creatorcontrib><title>SAGE: Bridging Semantic and Actionable Parts for GEneralizable Manipulation of Articulated Objects</title><description>To interact with daily-life articulated objects of diverse structures and functionalities, understanding the object parts plays a central role in both user instruction comprehension and task execution. However, the possible discordance between the semantic meaning and physics functionalities of the parts poses a challenge for designing a general system. To address this problem, we propose SAGE, a novel framework that bridges semantic and actionable parts of articulated objects to achieve generalizable manipulation under natural language instructions. More concretely, given an articulated object, we first observe all the semantic parts on it, conditioned on which an instruction interpreter proposes possible action programs that concretize the natural language instruction. Then, a part-grounding module maps the semantic parts into so-called Generalizable Actionable Parts (GAParts), which inherently carry information about part motion. End-effector trajectories are predicted on the GAParts, which, together with the action program, form an executable policy. Additionally, an interactive feedback module is incorporated to respond to failures, which closes the loop and increases the robustness of the overall framework. Key to the success of our framework is the joint proposal and knowledge fusion between a large vision-language model (VLM) and a small domain-specific model for both context comprehension and part perception, with the former providing general intuitions and the latter serving as expert facts. Both simulation and real-robot experiments show our effectiveness in handling a large variety of articulated objects with diverse language-instructed goals.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Robotics</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFjsEKgkAURWfTIqoPaNX7gWzMpGhnYbaJAtvLU0d5MY7ynKL6-lLat7rcy7lwhJi60lltfF8ukJ_0cJaeu3Sk68n1UKRxEIVb2DHlJZkSYlWhsZQBmhyCzFJtMNUKLsi2haJmiEKjGDW9-_2Ehpq7xg6EuoCAv-euqxzO6U1lth2LQYG6VZNfjsTsEF73x3lvkzRMFfIr6ayS3sr7T3wAZKpCcA</recordid><startdate>20231203</startdate><enddate>20231203</enddate><creator>Geng, Haoran</creator><creator>Wei, Songlin</creator><creator>Deng, Congyue</creator><creator>Shen, Bokui</creator><creator>Wang, He</creator><creator>Guibas, Leonidas</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231203</creationdate><title>SAGE: Bridging Semantic and Actionable Parts for GEneralizable Manipulation of Articulated Objects</title><author>Geng, Haoran ; Wei, Songlin ; Deng, Congyue ; Shen, Bokui ; Wang, He ; Guibas, Leonidas</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2312_013073</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Robotics</topic><toplevel>online_resources</toplevel><creatorcontrib>Geng, Haoran</creatorcontrib><creatorcontrib>Wei, Songlin</creatorcontrib><creatorcontrib>Deng, Congyue</creatorcontrib><creatorcontrib>Shen, Bokui</creatorcontrib><creatorcontrib>Wang, He</creatorcontrib><creatorcontrib>Guibas, Leonidas</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Geng, Haoran</au><au>Wei, Songlin</au><au>Deng, Congyue</au><au>Shen, Bokui</au><au>Wang, He</au><au>Guibas, Leonidas</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SAGE: Bridging Semantic and Actionable Parts for GEneralizable Manipulation of Articulated Objects</atitle><date>2023-12-03</date><risdate>2023</risdate><abstract>To interact with daily-life articulated objects of diverse structures and functionalities, understanding the object parts plays a central role in both user instruction comprehension and task execution. However, the possible discordance between the semantic meaning and physics functionalities of the parts poses a challenge for designing a general system. To address this problem, we propose SAGE, a novel framework that bridges semantic and actionable parts of articulated objects to achieve generalizable manipulation under natural language instructions. More concretely, given an articulated object, we first observe all the semantic parts on it, conditioned on which an instruction interpreter proposes possible action programs that concretize the natural language instruction. Then, a part-grounding module maps the semantic parts into so-called Generalizable Actionable Parts (GAParts), which inherently carry information about part motion. End-effector trajectories are predicted on the GAParts, which, together with the action program, form an executable policy. Additionally, an interactive feedback module is incorporated to respond to failures, which closes the loop and increases the robustness of the overall framework. Key to the success of our framework is the joint proposal and knowledge fusion between a large vision-language model (VLM) and a small domain-specific model for both context comprehension and part perception, with the former providing general intuitions and the latter serving as expert facts. Both simulation and real-robot experiments show our effectiveness in handling a large variety of articulated objects with diverse language-instructed goals.</abstract><doi>10.48550/arxiv.2312.01307</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2312.01307
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2312_01307
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition Computer Science - Robotics
title	SAGE: Bridging Semantic and Actionable Parts for GEneralizable Manipulation of Articulated Objects
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-17T19%3A43%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SAGE:%20Bridging%20Semantic%20and%20Actionable%20Parts%20for%20GEneralizable%20Manipulation%20of%20Articulated%20Objects&rft.au=Geng,%20Haoran&rft.date=2023-12-03&rft_id=info:doi/10.48550/arxiv.2312.01307&rft_dat=%3Carxiv_GOX%3E2312_01307%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true