Vision-Dialog Navigation by Exploring Cross-modal Memory

Vision-dialog navigation posed as a new holy-grail task in vision-language disciplinary targets at learning an agent endowed with the capability of constant conversation for help with natural language and navigating according to human responses. Besides the common challenges faced in visual language...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2020-03
Hauptverfasser: Zhu, Yi, Zhu, Fengda, Zhan, Zhaohuan, Lin, Bingqian, Jiao, Jianbin, Chang, Xiaojun, Liang, Xiaodan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Zhu, Yi
Zhu, Fengda
Zhan, Zhaohuan
Lin, Bingqian
Jiao, Jianbin
Chang, Xiaojun
Liang, Xiaodan
description Vision-dialog navigation posed as a new holy-grail task in vision-language disciplinary targets at learning an agent endowed with the capability of constant conversation for help with natural language and navigating according to human responses. Besides the common challenges faced in visual language navigation, vision-dialog navigation also requires to handle well with the language intentions of a series of questions about the temporal context from dialogue history and co-reasoning both dialogs and visual scenes. In this paper, we propose the Cross-modal Memory Network (CMN) for remembering and understanding the rich information relevant to historical navigation actions. Our CMN consists of two memory modules, the language memory module (L-mem) and the visual memory module (V-mem). Specifically, L-mem learns latent relationships between the current language interaction and a dialog history by employing a multi-head attention mechanism. V-mem learns to associate the current visual views and the cross-modal memory about the previous navigation actions. The cross-modal memory is generated via a vision-to-language attention and a language-to-vision attention. Benefiting from the collaborative learning of the L-mem and the V-mem, our CMN is able to explore the memory about the decision making of historical navigation actions which is for the current step. Experiments on the CVDN dataset show that our CMN outperforms the previous state-of-the-art model by a significant margin on both seen and unseen environments.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2377808439</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2377808439</sourcerecordid><originalsourceid>FETCH-proquest_journals_23778084393</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwCMsszszP03XJTMzJT1fwSyzLTE8sAYooJFUquFYU5OQXZealKzgX5RcX6-bmpyTmKPim5uYXVfIwsKYl5hSn8kJpbgZlN9cQZw_dgqL8wtLU4pL4rPzSojygVLyRsbm5hYGFibGlMXGqAD7bNhg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2377808439</pqid></control><display><type>article</type><title>Vision-Dialog Navigation by Exploring Cross-modal Memory</title><source>Free E- Journals</source><creator>Zhu, Yi ; Zhu, Fengda ; Zhan, Zhaohuan ; Lin, Bingqian ; Jiao, Jianbin ; Chang, Xiaojun ; Liang, Xiaodan</creator><creatorcontrib>Zhu, Yi ; Zhu, Fengda ; Zhan, Zhaohuan ; Lin, Bingqian ; Jiao, Jianbin ; Chang, Xiaojun ; Liang, Xiaodan</creatorcontrib><description>Vision-dialog navigation posed as a new holy-grail task in vision-language disciplinary targets at learning an agent endowed with the capability of constant conversation for help with natural language and navigating according to human responses. Besides the common challenges faced in visual language navigation, vision-dialog navigation also requires to handle well with the language intentions of a series of questions about the temporal context from dialogue history and co-reasoning both dialogs and visual scenes. In this paper, we propose the Cross-modal Memory Network (CMN) for remembering and understanding the rich information relevant to historical navigation actions. Our CMN consists of two memory modules, the language memory module (L-mem) and the visual memory module (V-mem). Specifically, L-mem learns latent relationships between the current language interaction and a dialog history by employing a multi-head attention mechanism. V-mem learns to associate the current visual views and the cross-modal memory about the previous navigation actions. The cross-modal memory is generated via a vision-to-language attention and a language-to-vision attention. Benefiting from the collaborative learning of the L-mem and the V-mem, our CMN is able to explore the memory about the decision making of historical navigation actions which is for the current step. Experiments on the CVDN dataset show that our CMN outperforms the previous state-of-the-art model by a significant margin on both seen and unseen environments.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Decision making ; Language ; Learning ; Modules ; Navigation ; Vision</subject><ispartof>arXiv.org, 2020-03</ispartof><rights>2020. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Zhu, Yi</creatorcontrib><creatorcontrib>Zhu, Fengda</creatorcontrib><creatorcontrib>Zhan, Zhaohuan</creatorcontrib><creatorcontrib>Lin, Bingqian</creatorcontrib><creatorcontrib>Jiao, Jianbin</creatorcontrib><creatorcontrib>Chang, Xiaojun</creatorcontrib><creatorcontrib>Liang, Xiaodan</creatorcontrib><title>Vision-Dialog Navigation by Exploring Cross-modal Memory</title><title>arXiv.org</title><description>Vision-dialog navigation posed as a new holy-grail task in vision-language disciplinary targets at learning an agent endowed with the capability of constant conversation for help with natural language and navigating according to human responses. Besides the common challenges faced in visual language navigation, vision-dialog navigation also requires to handle well with the language intentions of a series of questions about the temporal context from dialogue history and co-reasoning both dialogs and visual scenes. In this paper, we propose the Cross-modal Memory Network (CMN) for remembering and understanding the rich information relevant to historical navigation actions. Our CMN consists of two memory modules, the language memory module (L-mem) and the visual memory module (V-mem). Specifically, L-mem learns latent relationships between the current language interaction and a dialog history by employing a multi-head attention mechanism. V-mem learns to associate the current visual views and the cross-modal memory about the previous navigation actions. The cross-modal memory is generated via a vision-to-language attention and a language-to-vision attention. Benefiting from the collaborative learning of the L-mem and the V-mem, our CMN is able to explore the memory about the decision making of historical navigation actions which is for the current step. Experiments on the CVDN dataset show that our CMN outperforms the previous state-of-the-art model by a significant margin on both seen and unseen environments.</description><subject>Decision making</subject><subject>Language</subject><subject>Learning</subject><subject>Modules</subject><subject>Navigation</subject><subject>Vision</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwCMsszszP03XJTMzJT1fwSyzLTE8sAYooJFUquFYU5OQXZealKzgX5RcX6-bmpyTmKPim5uYXVfIwsKYl5hSn8kJpbgZlN9cQZw_dgqL8wtLU4pL4rPzSojygVLyRsbm5hYGFibGlMXGqAD7bNhg</recordid><startdate>20200315</startdate><enddate>20200315</enddate><creator>Zhu, Yi</creator><creator>Zhu, Fengda</creator><creator>Zhan, Zhaohuan</creator><creator>Lin, Bingqian</creator><creator>Jiao, Jianbin</creator><creator>Chang, Xiaojun</creator><creator>Liang, Xiaodan</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20200315</creationdate><title>Vision-Dialog Navigation by Exploring Cross-modal Memory</title><author>Zhu, Yi ; Zhu, Fengda ; Zhan, Zhaohuan ; Lin, Bingqian ; Jiao, Jianbin ; Chang, Xiaojun ; Liang, Xiaodan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_23778084393</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Decision making</topic><topic>Language</topic><topic>Learning</topic><topic>Modules</topic><topic>Navigation</topic><topic>Vision</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhu, Yi</creatorcontrib><creatorcontrib>Zhu, Fengda</creatorcontrib><creatorcontrib>Zhan, Zhaohuan</creatorcontrib><creatorcontrib>Lin, Bingqian</creatorcontrib><creatorcontrib>Jiao, Jianbin</creatorcontrib><creatorcontrib>Chang, Xiaojun</creatorcontrib><creatorcontrib>Liang, Xiaodan</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhu, Yi</au><au>Zhu, Fengda</au><au>Zhan, Zhaohuan</au><au>Lin, Bingqian</au><au>Jiao, Jianbin</au><au>Chang, Xiaojun</au><au>Liang, Xiaodan</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Vision-Dialog Navigation by Exploring Cross-modal Memory</atitle><jtitle>arXiv.org</jtitle><date>2020-03-15</date><risdate>2020</risdate><eissn>2331-8422</eissn><abstract>Vision-dialog navigation posed as a new holy-grail task in vision-language disciplinary targets at learning an agent endowed with the capability of constant conversation for help with natural language and navigating according to human responses. Besides the common challenges faced in visual language navigation, vision-dialog navigation also requires to handle well with the language intentions of a series of questions about the temporal context from dialogue history and co-reasoning both dialogs and visual scenes. In this paper, we propose the Cross-modal Memory Network (CMN) for remembering and understanding the rich information relevant to historical navigation actions. Our CMN consists of two memory modules, the language memory module (L-mem) and the visual memory module (V-mem). Specifically, L-mem learns latent relationships between the current language interaction and a dialog history by employing a multi-head attention mechanism. V-mem learns to associate the current visual views and the cross-modal memory about the previous navigation actions. The cross-modal memory is generated via a vision-to-language attention and a language-to-vision attention. Benefiting from the collaborative learning of the L-mem and the V-mem, our CMN is able to explore the memory about the decision making of historical navigation actions which is for the current step. Experiments on the CVDN dataset show that our CMN outperforms the previous state-of-the-art model by a significant margin on both seen and unseen environments.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2020-03
issn 2331-8422
language eng
recordid cdi_proquest_journals_2377808439
source Free E- Journals
subjects Decision making
Language
Learning
Modules
Navigation
Vision
title Vision-Dialog Navigation by Exploring Cross-modal Memory
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T18%3A10%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Vision-Dialog%20Navigation%20by%20Exploring%20Cross-modal%20Memory&rft.jtitle=arXiv.org&rft.au=Zhu,%20Yi&rft.date=2020-03-15&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2377808439%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2377808439&rft_id=info:pmid/&rfr_iscdi=true