Multimodal Sequential Generative Models for Semi-Supervised Language Instruction Following

Agents that can follow language instructions are expected to be useful in a variety of situations such as navigation. However, training neural network-based agents requires numerous paired trajectories and languages. This paper proposes using multimodal generative models for semi-supervised learning...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2022-12
Hauptverfasser:	Akuzawa, Kei, Iwasawa, Yusuke, Matsuo, Yutaka
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer architecture Language instruction Machine learning Neural networks Performance enhancement Regularization Representations Semi-supervised learning
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Akuzawa, Kei Iwasawa, Yusuke Matsuo, Yutaka
description	Agents that can follow language instructions are expected to be useful in a variety of situations such as navigation. However, training neural network-based agents requires numerous paired trajectories and languages. This paper proposes using multimodal generative models for semi-supervised learning in the instruction following tasks. The models learn a shared representation of the paired data, and enable semi-supervised learning by reconstructing unpaired data through the representation. Key challenges in applying the models to sequence-to-sequence tasks including instruction following are learning a shared representation of variable-length mulitimodal data and incorporating attention mechanisms. To address the problems, this paper proposes a novel network architecture to absorb the difference in the sequence lengths of the multimodal data. In addition, to further improve the performance, this paper shows how to incorporate the generative model-based approach with an existing semi-supervised method called a speaker-follower model, and proposes a regularization term that improves inference using unpaired trajectories. Experiments on BabyAI and Room-to-Room (R2R) environments show that the proposed method improves the performance of instruction following by leveraging unpaired data, and improves the performance of the speaker-follower model by 2\% to 4\% in R2R.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2760379789</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2760379789</sourcerecordid><originalsourceid>FETCH-proquest_journals_27603797893</originalsourceid><addsrcrecordid>eNqNjU8LgjAchkcQJOV3GHQW1pZOz5EV5MlOXWTkT5nMzfbHvn4e-gCd3geeB94ViihjhyQ_UrpBsXMDIYRmnKYpi9CzCsrL0bRC4RreAbSXC15AgxVezoAr04JyuDN2CUaZ1GECO0sHLb4L3QfRA75p5214eWk0Lo1S5iN1v0PrTigH8W-3aF-eH6drMlmzHDnfDCZYvaiG8owwXvC8YP9VX5lKQ5c</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2760379789</pqid></control><display><type>article</type><title>Multimodal Sequential Generative Models for Semi-Supervised Language Instruction Following</title><source>Free E- Journals</source><creator>Akuzawa, Kei ; Iwasawa, Yusuke ; Matsuo, Yutaka</creator><creatorcontrib>Akuzawa, Kei ; Iwasawa, Yusuke ; Matsuo, Yutaka</creatorcontrib><description>Agents that can follow language instructions are expected to be useful in a variety of situations such as navigation. However, training neural network-based agents requires numerous paired trajectories and languages. This paper proposes using multimodal generative models for semi-supervised learning in the instruction following tasks. The models learn a shared representation of the paired data, and enable semi-supervised learning by reconstructing unpaired data through the representation. Key challenges in applying the models to sequence-to-sequence tasks including instruction following are learning a shared representation of variable-length mulitimodal data and incorporating attention mechanisms. To address the problems, this paper proposes a novel network architecture to absorb the difference in the sequence lengths of the multimodal data. In addition, to further improve the performance, this paper shows how to incorporate the generative model-based approach with an existing semi-supervised method called a speaker-follower model, and proposes a regularization term that improves inference using unpaired trajectories. Experiments on BabyAI and Room-to-Room (R2R) environments show that the proposed method improves the performance of instruction following by leveraging unpaired data, and improves the performance of the speaker-follower model by 2\% to 4\% in R2R.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Computer architecture ; Language instruction ; Machine learning ; Neural networks ; Performance enhancement ; Regularization ; Representations ; Semi-supervised learning</subject><ispartof>arXiv.org, 2022-12</ispartof><rights>2022. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>778,782</link.rule.ids></links><search><creatorcontrib>Akuzawa, Kei</creatorcontrib><creatorcontrib>Iwasawa, Yusuke</creatorcontrib><creatorcontrib>Matsuo, Yutaka</creatorcontrib><title>Multimodal Sequential Generative Models for Semi-Supervised Language Instruction Following</title><title>arXiv.org</title><description>Agents that can follow language instructions are expected to be useful in a variety of situations such as navigation. However, training neural network-based agents requires numerous paired trajectories and languages. This paper proposes using multimodal generative models for semi-supervised learning in the instruction following tasks. The models learn a shared representation of the paired data, and enable semi-supervised learning by reconstructing unpaired data through the representation. Key challenges in applying the models to sequence-to-sequence tasks including instruction following are learning a shared representation of variable-length mulitimodal data and incorporating attention mechanisms. To address the problems, this paper proposes a novel network architecture to absorb the difference in the sequence lengths of the multimodal data. In addition, to further improve the performance, this paper shows how to incorporate the generative model-based approach with an existing semi-supervised method called a speaker-follower model, and proposes a regularization term that improves inference using unpaired trajectories. Experiments on BabyAI and Room-to-Room (R2R) environments show that the proposed method improves the performance of instruction following by leveraging unpaired data, and improves the performance of the speaker-follower model by 2\% to 4\% in R2R.</description><subject>Computer architecture</subject><subject>Language instruction</subject><subject>Machine learning</subject><subject>Neural networks</subject><subject>Performance enhancement</subject><subject>Regularization</subject><subject>Representations</subject><subject>Semi-supervised learning</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNjU8LgjAchkcQJOV3GHQW1pZOz5EV5MlOXWTkT5nMzfbHvn4e-gCd3geeB94ViihjhyQ_UrpBsXMDIYRmnKYpi9CzCsrL0bRC4RreAbSXC15AgxVezoAr04JyuDN2CUaZ1GECO0sHLb4L3QfRA75p5214eWk0Lo1S5iN1v0PrTigH8W-3aF-eH6drMlmzHDnfDCZYvaiG8owwXvC8YP9VX5lKQ5c</recordid><startdate>20221229</startdate><enddate>20221229</enddate><creator>Akuzawa, Kei</creator><creator>Iwasawa, Yusuke</creator><creator>Matsuo, Yutaka</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20221229</creationdate><title>Multimodal Sequential Generative Models for Semi-Supervised Language Instruction Following</title><author>Akuzawa, Kei ; Iwasawa, Yusuke ; Matsuo, Yutaka</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_27603797893</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer architecture</topic><topic>Language instruction</topic><topic>Machine learning</topic><topic>Neural networks</topic><topic>Performance enhancement</topic><topic>Regularization</topic><topic>Representations</topic><topic>Semi-supervised learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Akuzawa, Kei</creatorcontrib><creatorcontrib>Iwasawa, Yusuke</creatorcontrib><creatorcontrib>Matsuo, Yutaka</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Akuzawa, Kei</au><au>Iwasawa, Yusuke</au><au>Matsuo, Yutaka</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Multimodal Sequential Generative Models for Semi-Supervised Language Instruction Following</atitle><jtitle>arXiv.org</jtitle><date>2022-12-29</date><risdate>2022</risdate><eissn>2331-8422</eissn><abstract>Agents that can follow language instructions are expected to be useful in a variety of situations such as navigation. However, training neural network-based agents requires numerous paired trajectories and languages. This paper proposes using multimodal generative models for semi-supervised learning in the instruction following tasks. The models learn a shared representation of the paired data, and enable semi-supervised learning by reconstructing unpaired data through the representation. Key challenges in applying the models to sequence-to-sequence tasks including instruction following are learning a shared representation of variable-length mulitimodal data and incorporating attention mechanisms. To address the problems, this paper proposes a novel network architecture to absorb the difference in the sequence lengths of the multimodal data. In addition, to further improve the performance, this paper shows how to incorporate the generative model-based approach with an existing semi-supervised method called a speaker-follower model, and proposes a regularization term that improves inference using unpaired trajectories. Experiments on BabyAI and Room-to-Room (R2R) environments show that the proposed method improves the performance of instruction following by leveraging unpaired data, and improves the performance of the speaker-follower model by 2\% to 4\% in R2R.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2022-12
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2760379789
source	Free E- Journals
subjects	Computer architecture Language instruction Machine learning Neural networks Performance enhancement Regularization Representations Semi-supervised learning
title	Multimodal Sequential Generative Models for Semi-Supervised Language Instruction Following
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-15T22%3A09%3A28IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Multimodal%20Sequential%20Generative%20Models%20for%20Semi-Supervised%20Language%20Instruction%20Following&rft.jtitle=arXiv.org&rft.au=Akuzawa,%20Kei&rft.date=2022-12-29&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2760379789%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2760379789&rft_id=info:pmid/&rfr_iscdi=true