Scene-Aware Activity Program Generation with Language Guidance

We address the problem of scene-aware activity program generation, which requires decomposing a given activity task into instructions that can be sequentially performed within a target scene to complete the activity. While existing methods have shown the ability to generate rational or executable pr...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ACM transactions on graphics 2023-12, Vol.42 (6), p.1-16, Article 252
Hauptverfasser:	Su, Zejia, Fan, Qingnan, Chen, Xuelin, Van Kaick, Oliver, Huang, Hui, Hu, Ruizhen
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial intelligence Computer graphics Computing methodologies
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	16
container_issue	6
container_start_page	1
container_title	ACM transactions on graphics
container_volume	42
creator	Su, Zejia Fan, Qingnan Chen, Xuelin Van Kaick, Oliver Huang, Hui Hu, Ruizhen
description	We address the problem of scene-aware activity program generation, which requires decomposing a given activity task into instructions that can be sequentially performed within a target scene to complete the activity. While existing methods have shown the ability to generate rational or executable programs, generating programs with both high rationality and executability still remains a challenge. Hence, we propose a novel method where the key idea is to explicitly combine the language rationality of a powerful language model with dynamic perception of the target scene where instructions are executed, to generate programs with high rationality and executability. Our method iteratively generates instructions for the activity program. Specifically, a two-branch feature encoder operates on a language-based and graph-based representation of the current generation progress to extract language features and scene graph features, respectively. These features are then used by a predictor to generate the next instruction in the program. Subsequently, another module performs the predicted action and updates the scene for perception in the next iteration. Extensive evaluations are conducted on the VirtualHome-Env dataset, showing the advantages of our method over previous work. Key algorithmic designs are validated through ablation studies, and results on other types of inputs are also presented to show the generalizability of our method.
doi_str_mv	10.1145/3618338
format	Article
fullrecord	<record><control><sourceid>acm_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3618338</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3618338</sourcerecordid><originalsourceid>FETCH-LOGICAL-a206t-143607fdffb4ec117048dcd94567362fb2c4cb073cea180b52075190933da863</originalsourceid><addsrcrecordid>eNo9j01LxDAURYMoWEdx7yo7V9H3mq92IwyDU4WCgrMvaZrUiG0l7TjMv7cyo6u7OIfLvYRcI9whCnnPFWacZyckQSk101xlpyQBzYEBBzwnF-P4AQBKCJWQhzfreseWOxMdXdopfIdpT1_j0EbT0WJm0Uxh6OkuTO-0NH27Na2jxTY0prfukpx58zm6q2MuyGb9uFk9sfKleF4tS2ZSUBNDwRVo33hfC2cRNYissU0upJr3pb5OrbD1vNE6gxnUMgUtMYec88Zkii_I7aHWxmEco_PVVwydifsKofp9XR1fz-bNwTS2-5f-4A9W3FBn</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Scene-Aware Activity Program Generation with Language Guidance</title><source>ACM Digital Library Complete</source><creator>Su, Zejia ; Fan, Qingnan ; Chen, Xuelin ; Van Kaick, Oliver ; Huang, Hui ; Hu, Ruizhen</creator><creatorcontrib>Su, Zejia ; Fan, Qingnan ; Chen, Xuelin ; Van Kaick, Oliver ; Huang, Hui ; Hu, Ruizhen</creatorcontrib><description>We address the problem of scene-aware activity program generation, which requires decomposing a given activity task into instructions that can be sequentially performed within a target scene to complete the activity. While existing methods have shown the ability to generate rational or executable programs, generating programs with both high rationality and executability still remains a challenge. Hence, we propose a novel method where the key idea is to explicitly combine the language rationality of a powerful language model with dynamic perception of the target scene where instructions are executed, to generate programs with high rationality and executability. Our method iteratively generates instructions for the activity program. Specifically, a two-branch feature encoder operates on a language-based and graph-based representation of the current generation progress to extract language features and scene graph features, respectively. These features are then used by a predictor to generate the next instruction in the program. Subsequently, another module performs the predicted action and updates the scene for perception in the next iteration. Extensive evaluations are conducted on the VirtualHome-Env dataset, showing the advantages of our method over previous work. Key algorithmic designs are validated through ablation studies, and results on other types of inputs are also presented to show the generalizability of our method.</description><identifier>ISSN: 0730-0301</identifier><identifier>EISSN: 1557-7368</identifier><identifier>DOI: 10.1145/3618338</identifier><language>eng</language><publisher>New York, NY, USA: ACM</publisher><subject>Artificial intelligence ; Computer graphics ; Computing methodologies</subject><ispartof>ACM transactions on graphics, 2023-12, Vol.42 (6), p.1-16, Article 252</ispartof><rights>ACM</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-a206t-143607fdffb4ec117048dcd94567362fb2c4cb073cea180b52075190933da863</cites><orcidid>0000-0003-1249-2826 ; 0000-0002-6798-0336 ; 0000-0001-9869-6832 ; 0009-0007-0158-9469 ; 0000-0003-3212-0544 ; 0000-0001-9121-6779</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://dl.acm.org/doi/pdf/10.1145/3618338$$EPDF$$P50$$Gacm$$H</linktopdf><link.rule.ids>314,776,780,2276,27901,27902,40172,75971</link.rule.ids></links><search><creatorcontrib>Su, Zejia</creatorcontrib><creatorcontrib>Fan, Qingnan</creatorcontrib><creatorcontrib>Chen, Xuelin</creatorcontrib><creatorcontrib>Van Kaick, Oliver</creatorcontrib><creatorcontrib>Huang, Hui</creatorcontrib><creatorcontrib>Hu, Ruizhen</creatorcontrib><title>Scene-Aware Activity Program Generation with Language Guidance</title><title>ACM transactions on graphics</title><addtitle>ACM TOG</addtitle><description>We address the problem of scene-aware activity program generation, which requires decomposing a given activity task into instructions that can be sequentially performed within a target scene to complete the activity. While existing methods have shown the ability to generate rational or executable programs, generating programs with both high rationality and executability still remains a challenge. Hence, we propose a novel method where the key idea is to explicitly combine the language rationality of a powerful language model with dynamic perception of the target scene where instructions are executed, to generate programs with high rationality and executability. Our method iteratively generates instructions for the activity program. Specifically, a two-branch feature encoder operates on a language-based and graph-based representation of the current generation progress to extract language features and scene graph features, respectively. These features are then used by a predictor to generate the next instruction in the program. Subsequently, another module performs the predicted action and updates the scene for perception in the next iteration. Extensive evaluations are conducted on the VirtualHome-Env dataset, showing the advantages of our method over previous work. Key algorithmic designs are validated through ablation studies, and results on other types of inputs are also presented to show the generalizability of our method.</description><subject>Artificial intelligence</subject><subject>Computer graphics</subject><subject>Computing methodologies</subject><issn>0730-0301</issn><issn>1557-7368</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNo9j01LxDAURYMoWEdx7yo7V9H3mq92IwyDU4WCgrMvaZrUiG0l7TjMv7cyo6u7OIfLvYRcI9whCnnPFWacZyckQSk101xlpyQBzYEBBzwnF-P4AQBKCJWQhzfreseWOxMdXdopfIdpT1_j0EbT0WJm0Uxh6OkuTO-0NH27Na2jxTY0prfukpx58zm6q2MuyGb9uFk9sfKleF4tS2ZSUBNDwRVo33hfC2cRNYissU0upJr3pb5OrbD1vNE6gxnUMgUtMYec88Zkii_I7aHWxmEco_PVVwydifsKofp9XR1fz-bNwTS2-5f-4A9W3FBn</recordid><startdate>20231204</startdate><enddate>20231204</enddate><creator>Su, Zejia</creator><creator>Fan, Qingnan</creator><creator>Chen, Xuelin</creator><creator>Van Kaick, Oliver</creator><creator>Huang, Hui</creator><creator>Hu, Ruizhen</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0003-1249-2826</orcidid><orcidid>https://orcid.org/0000-0002-6798-0336</orcidid><orcidid>https://orcid.org/0000-0001-9869-6832</orcidid><orcidid>https://orcid.org/0009-0007-0158-9469</orcidid><orcidid>https://orcid.org/0000-0003-3212-0544</orcidid><orcidid>https://orcid.org/0000-0001-9121-6779</orcidid></search><sort><creationdate>20231204</creationdate><title>Scene-Aware Activity Program Generation with Language Guidance</title><author>Su, Zejia ; Fan, Qingnan ; Chen, Xuelin ; Van Kaick, Oliver ; Huang, Hui ; Hu, Ruizhen</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a206t-143607fdffb4ec117048dcd94567362fb2c4cb073cea180b52075190933da863</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Artificial intelligence</topic><topic>Computer graphics</topic><topic>Computing methodologies</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Su, Zejia</creatorcontrib><creatorcontrib>Fan, Qingnan</creatorcontrib><creatorcontrib>Chen, Xuelin</creatorcontrib><creatorcontrib>Van Kaick, Oliver</creatorcontrib><creatorcontrib>Huang, Hui</creatorcontrib><creatorcontrib>Hu, Ruizhen</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on graphics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Su, Zejia</au><au>Fan, Qingnan</au><au>Chen, Xuelin</au><au>Van Kaick, Oliver</au><au>Huang, Hui</au><au>Hu, Ruizhen</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Scene-Aware Activity Program Generation with Language Guidance</atitle><jtitle>ACM transactions on graphics</jtitle><stitle>ACM TOG</stitle><date>2023-12-04</date><risdate>2023</risdate><volume>42</volume><issue>6</issue><spage>1</spage><epage>16</epage><pages>1-16</pages><artnum>252</artnum><issn>0730-0301</issn><eissn>1557-7368</eissn><abstract>We address the problem of scene-aware activity program generation, which requires decomposing a given activity task into instructions that can be sequentially performed within a target scene to complete the activity. While existing methods have shown the ability to generate rational or executable programs, generating programs with both high rationality and executability still remains a challenge. Hence, we propose a novel method where the key idea is to explicitly combine the language rationality of a powerful language model with dynamic perception of the target scene where instructions are executed, to generate programs with high rationality and executability. Our method iteratively generates instructions for the activity program. Specifically, a two-branch feature encoder operates on a language-based and graph-based representation of the current generation progress to extract language features and scene graph features, respectively. These features are then used by a predictor to generate the next instruction in the program. Subsequently, another module performs the predicted action and updates the scene for perception in the next iteration. Extensive evaluations are conducted on the VirtualHome-Env dataset, showing the advantages of our method over previous work. Key algorithmic designs are validated through ablation studies, and results on other types of inputs are also presented to show the generalizability of our method.</abstract><cop>New York, NY, USA</cop><pub>ACM</pub><doi>10.1145/3618338</doi><tpages>16</tpages><orcidid>https://orcid.org/0000-0003-1249-2826</orcidid><orcidid>https://orcid.org/0000-0002-6798-0336</orcidid><orcidid>https://orcid.org/0000-0001-9869-6832</orcidid><orcidid>https://orcid.org/0009-0007-0158-9469</orcidid><orcidid>https://orcid.org/0000-0003-3212-0544</orcidid><orcidid>https://orcid.org/0000-0001-9121-6779</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 0730-0301
ispartof	ACM transactions on graphics, 2023-12, Vol.42 (6), p.1-16, Article 252
issn	0730-0301 1557-7368
language	eng
recordid	cdi_crossref_primary_10_1145_3618338
source	ACM Digital Library Complete
subjects	Artificial intelligence Computer graphics Computing methodologies
title	Scene-Aware Activity Program Generation with Language Guidance
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-03T18%3A54%3A17IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Scene-Aware%20Activity%20Program%20Generation%20with%20Language%20Guidance&rft.jtitle=ACM%20transactions%20on%20graphics&rft.au=Su,%20Zejia&rft.date=2023-12-04&rft.volume=42&rft.issue=6&rft.spage=1&rft.epage=16&rft.pages=1-16&rft.artnum=252&rft.issn=0730-0301&rft.eissn=1557-7368&rft_id=info:doi/10.1145/3618338&rft_dat=%3Cacm_cross%3E3618338%3C/acm_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true