Scene-Aware Activity Program Generation with Language Guidance

We address the problem of scene-aware activity program generation, which requires decomposing a given activity task into instructions that can be sequentially performed within a target scene to complete the activity. While existing methods have shown the ability to generate rational or executable pr...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:ACM transactions on graphics 2023-12, Vol.42 (6), p.1-16, Article 252
Hauptverfasser: Su, Zejia, Fan, Qingnan, Chen, Xuelin, Van Kaick, Oliver, Huang, Hui, Hu, Ruizhen
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 16
container_issue 6
container_start_page 1
container_title ACM transactions on graphics
container_volume 42
creator Su, Zejia
Fan, Qingnan
Chen, Xuelin
Van Kaick, Oliver
Huang, Hui
Hu, Ruizhen
description We address the problem of scene-aware activity program generation, which requires decomposing a given activity task into instructions that can be sequentially performed within a target scene to complete the activity. While existing methods have shown the ability to generate rational or executable programs, generating programs with both high rationality and executability still remains a challenge. Hence, we propose a novel method where the key idea is to explicitly combine the language rationality of a powerful language model with dynamic perception of the target scene where instructions are executed, to generate programs with high rationality and executability. Our method iteratively generates instructions for the activity program. Specifically, a two-branch feature encoder operates on a language-based and graph-based representation of the current generation progress to extract language features and scene graph features, respectively. These features are then used by a predictor to generate the next instruction in the program. Subsequently, another module performs the predicted action and updates the scene for perception in the next iteration. Extensive evaluations are conducted on the VirtualHome-Env dataset, showing the advantages of our method over previous work. Key algorithmic designs are validated through ablation studies, and results on other types of inputs are also presented to show the generalizability of our method.
doi_str_mv 10.1145/3618338
format Article
fullrecord <record><control><sourceid>acm_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3618338</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3618338</sourcerecordid><originalsourceid>FETCH-LOGICAL-a206t-143607fdffb4ec117048dcd94567362fb2c4cb073cea180b52075190933da863</originalsourceid><addsrcrecordid>eNo9j01LxDAURYMoWEdx7yo7V9H3mq92IwyDU4WCgrMvaZrUiG0l7TjMv7cyo6u7OIfLvYRcI9whCnnPFWacZyckQSk101xlpyQBzYEBBzwnF-P4AQBKCJWQhzfreseWOxMdXdopfIdpT1_j0EbT0WJm0Uxh6OkuTO-0NH27Na2jxTY0prfukpx58zm6q2MuyGb9uFk9sfKleF4tS2ZSUBNDwRVo33hfC2cRNYissU0upJr3pb5OrbD1vNE6gxnUMgUtMYec88Zkii_I7aHWxmEco_PVVwydifsKofp9XR1fz-bNwTS2-5f-4A9W3FBn</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Scene-Aware Activity Program Generation with Language Guidance</title><source>ACM Digital Library Complete</source><creator>Su, Zejia ; Fan, Qingnan ; Chen, Xuelin ; Van Kaick, Oliver ; Huang, Hui ; Hu, Ruizhen</creator><creatorcontrib>Su, Zejia ; Fan, Qingnan ; Chen, Xuelin ; Van Kaick, Oliver ; Huang, Hui ; Hu, Ruizhen</creatorcontrib><description>We address the problem of scene-aware activity program generation, which requires decomposing a given activity task into instructions that can be sequentially performed within a target scene to complete the activity. While existing methods have shown the ability to generate rational or executable programs, generating programs with both high rationality and executability still remains a challenge. Hence, we propose a novel method where the key idea is to explicitly combine the language rationality of a powerful language model with dynamic perception of the target scene where instructions are executed, to generate programs with high rationality and executability. Our method iteratively generates instructions for the activity program. Specifically, a two-branch feature encoder operates on a language-based and graph-based representation of the current generation progress to extract language features and scene graph features, respectively. These features are then used by a predictor to generate the next instruction in the program. Subsequently, another module performs the predicted action and updates the scene for perception in the next iteration. Extensive evaluations are conducted on the VirtualHome-Env dataset, showing the advantages of our method over previous work. Key algorithmic designs are validated through ablation studies, and results on other types of inputs are also presented to show the generalizability of our method.</description><identifier>ISSN: 0730-0301</identifier><identifier>EISSN: 1557-7368</identifier><identifier>DOI: 10.1145/3618338</identifier><language>eng</language><publisher>New York, NY, USA: ACM</publisher><subject>Artificial intelligence ; Computer graphics ; Computing methodologies</subject><ispartof>ACM transactions on graphics, 2023-12, Vol.42 (6), p.1-16, Article 252</ispartof><rights>ACM</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-a206t-143607fdffb4ec117048dcd94567362fb2c4cb073cea180b52075190933da863</cites><orcidid>0000-0003-1249-2826 ; 0000-0002-6798-0336 ; 0000-0001-9869-6832 ; 0009-0007-0158-9469 ; 0000-0003-3212-0544 ; 0000-0001-9121-6779</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://dl.acm.org/doi/pdf/10.1145/3618338$$EPDF$$P50$$Gacm$$H</linktopdf><link.rule.ids>314,776,780,2276,27901,27902,40172,75971</link.rule.ids></links><search><creatorcontrib>Su, Zejia</creatorcontrib><creatorcontrib>Fan, Qingnan</creatorcontrib><creatorcontrib>Chen, Xuelin</creatorcontrib><creatorcontrib>Van Kaick, Oliver</creatorcontrib><creatorcontrib>Huang, Hui</creatorcontrib><creatorcontrib>Hu, Ruizhen</creatorcontrib><title>Scene-Aware Activity Program Generation with Language Guidance</title><title>ACM transactions on graphics</title><addtitle>ACM TOG</addtitle><description>We address the problem of scene-aware activity program generation, which requires decomposing a given activity task into instructions that can be sequentially performed within a target scene to complete the activity. While existing methods have shown the ability to generate rational or executable programs, generating programs with both high rationality and executability still remains a challenge. Hence, we propose a novel method where the key idea is to explicitly combine the language rationality of a powerful language model with dynamic perception of the target scene where instructions are executed, to generate programs with high rationality and executability. Our method iteratively generates instructions for the activity program. Specifically, a two-branch feature encoder operates on a language-based and graph-based representation of the current generation progress to extract language features and scene graph features, respectively. These features are then used by a predictor to generate the next instruction in the program. Subsequently, another module performs the predicted action and updates the scene for perception in the next iteration. Extensive evaluations are conducted on the VirtualHome-Env dataset, showing the advantages of our method over previous work. Key algorithmic designs are validated through ablation studies, and results on other types of inputs are also presented to show the generalizability of our method.</description><subject>Artificial intelligence</subject><subject>Computer graphics</subject><subject>Computing methodologies</subject><issn>0730-0301</issn><issn>1557-7368</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNo9j01LxDAURYMoWEdx7yo7V9H3mq92IwyDU4WCgrMvaZrUiG0l7TjMv7cyo6u7OIfLvYRcI9whCnnPFWacZyckQSk101xlpyQBzYEBBzwnF-P4AQBKCJWQhzfreseWOxMdXdopfIdpT1_j0EbT0WJm0Uxh6OkuTO-0NH27Na2jxTY0prfukpx58zm6q2MuyGb9uFk9sfKleF4tS2ZSUBNDwRVo33hfC2cRNYissU0upJr3pb5OrbD1vNE6gxnUMgUtMYec88Zkii_I7aHWxmEco_PVVwydifsKofp9XR1fz-bNwTS2-5f-4A9W3FBn</recordid><startdate>20231204</startdate><enddate>20231204</enddate><creator>Su, Zejia</creator><creator>Fan, Qingnan</creator><creator>Chen, Xuelin</creator><creator>Van Kaick, Oliver</creator><creator>Huang, Hui</creator><creator>Hu, Ruizhen</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0003-1249-2826</orcidid><orcidid>https://orcid.org/0000-0002-6798-0336</orcidid><orcidid>https://orcid.org/0000-0001-9869-6832</orcidid><orcidid>https://orcid.org/0009-0007-0158-9469</orcidid><orcidid>https://orcid.org/0000-0003-3212-0544</orcidid><orcidid>https://orcid.org/0000-0001-9121-6779</orcidid></search><sort><creationdate>20231204</creationdate><title>Scene-Aware Activity Program Generation with Language Guidance</title><author>Su, Zejia ; Fan, Qingnan ; Chen, Xuelin ; Van Kaick, Oliver ; Huang, Hui ; Hu, Ruizhen</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a206t-143607fdffb4ec117048dcd94567362fb2c4cb073cea180b52075190933da863</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Artificial intelligence</topic><topic>Computer graphics</topic><topic>Computing methodologies</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Su, Zejia</creatorcontrib><creatorcontrib>Fan, Qingnan</creatorcontrib><creatorcontrib>Chen, Xuelin</creatorcontrib><creatorcontrib>Van Kaick, Oliver</creatorcontrib><creatorcontrib>Huang, Hui</creatorcontrib><creatorcontrib>Hu, Ruizhen</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on graphics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Su, Zejia</au><au>Fan, Qingnan</au><au>Chen, Xuelin</au><au>Van Kaick, Oliver</au><au>Huang, Hui</au><au>Hu, Ruizhen</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Scene-Aware Activity Program Generation with Language Guidance</atitle><jtitle>ACM transactions on graphics</jtitle><stitle>ACM TOG</stitle><date>2023-12-04</date><risdate>2023</risdate><volume>42</volume><issue>6</issue><spage>1</spage><epage>16</epage><pages>1-16</pages><artnum>252</artnum><issn>0730-0301</issn><eissn>1557-7368</eissn><abstract>We address the problem of scene-aware activity program generation, which requires decomposing a given activity task into instructions that can be sequentially performed within a target scene to complete the activity. While existing methods have shown the ability to generate rational or executable programs, generating programs with both high rationality and executability still remains a challenge. Hence, we propose a novel method where the key idea is to explicitly combine the language rationality of a powerful language model with dynamic perception of the target scene where instructions are executed, to generate programs with high rationality and executability. Our method iteratively generates instructions for the activity program. Specifically, a two-branch feature encoder operates on a language-based and graph-based representation of the current generation progress to extract language features and scene graph features, respectively. These features are then used by a predictor to generate the next instruction in the program. Subsequently, another module performs the predicted action and updates the scene for perception in the next iteration. Extensive evaluations are conducted on the VirtualHome-Env dataset, showing the advantages of our method over previous work. Key algorithmic designs are validated through ablation studies, and results on other types of inputs are also presented to show the generalizability of our method.</abstract><cop>New York, NY, USA</cop><pub>ACM</pub><doi>10.1145/3618338</doi><tpages>16</tpages><orcidid>https://orcid.org/0000-0003-1249-2826</orcidid><orcidid>https://orcid.org/0000-0002-6798-0336</orcidid><orcidid>https://orcid.org/0000-0001-9869-6832</orcidid><orcidid>https://orcid.org/0009-0007-0158-9469</orcidid><orcidid>https://orcid.org/0000-0003-3212-0544</orcidid><orcidid>https://orcid.org/0000-0001-9121-6779</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 0730-0301
ispartof ACM transactions on graphics, 2023-12, Vol.42 (6), p.1-16, Article 252
issn 0730-0301
1557-7368
language eng
recordid cdi_crossref_primary_10_1145_3618338
source ACM Digital Library Complete
subjects Artificial intelligence
Computer graphics
Computing methodologies
title Scene-Aware Activity Program Generation with Language Guidance
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-03T18%3A54%3A17IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Scene-Aware%20Activity%20Program%20Generation%20with%20Language%20Guidance&rft.jtitle=ACM%20transactions%20on%20graphics&rft.au=Su,%20Zejia&rft.date=2023-12-04&rft.volume=42&rft.issue=6&rft.spage=1&rft.epage=16&rft.pages=1-16&rft.artnum=252&rft.issn=0730-0301&rft.eissn=1557-7368&rft_id=info:doi/10.1145/3618338&rft_dat=%3Cacm_cross%3E3618338%3C/acm_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true