Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning

Vision and voice are two vital keys for agents' interaction and learning. In this paper, we present a novel indoor navigation model called Memory Vision-Voice Indoor Navigation (MVV-IN), which receives voice commands and analyzes multimodal information of visual observation in order to enhance...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Yan, Liqi, Liu, Dongfang, Song, Yaoxian, Yu, Changbin
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Yan, Liqi Liu, Dongfang Song, Yaoxian Yu, Changbin
description	Vision and voice are two vital keys for agents' interaction and learning. In this paper, we present a novel indoor navigation model called Memory Vision-Voice Indoor Navigation (MVV-IN), which receives voice commands and analyzes multimodal information of visual observation in order to enhance robots' environment understanding. We make use of single RGB images taken by a first-view monocular camera. We also apply a self-attention mechanism to keep the agent focusing on key areas. Memory is important for the agent to avoid repeating certain tasks unnecessarily and in order for it to adapt adequately to new scenes, therefore, we make use of meta-learning. We have experimented with various functional features extracted from visual observation. Comparative experiments prove that our methods outperform state-of-the-art baselines.
doi_str_mv	10.48550/arxiv.2009.00402
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2009_00402</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2009_00402</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-688ae035903fd5206cfe25adff3afa362d2e7f64dfa992f488477bdda97c40fa3</originalsourceid><addsrcrecordid>eNotj8tOwzAURL1hgQofwAr_QMLFcWJ7GVU8KqWwqbpCii6xb2opiSM3FPr3pI_VLM7MSIexh2dIpc5zeML45w-pADApgARxy77WP93k-2Cx42XbRtfi5MPAy3GMAZsdpxD52vUhHvnW72eUbINvHF8NNszoAw_-Ovn1026uTphUDuPgh_aO3RB2e3d_zQXbvL5slu9J9fm2WpZVgoUSSaE1OshyAxnZXEDRkBM5WqIMCbNCWOEUFdISGiNIai2V-rYWjWokzI0Fe7zcnvXqMfoe47E-adZnzewfvUJO3g</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning</title><source>arXiv.org</source><creator>Yan, Liqi ; Liu, Dongfang ; Song, Yaoxian ; Yu, Changbin</creator><creatorcontrib>Yan, Liqi ; Liu, Dongfang ; Song, Yaoxian ; Yu, Changbin</creatorcontrib><description>Vision and voice are two vital keys for agents' interaction and learning. In this paper, we present a novel indoor navigation model called Memory Vision-Voice Indoor Navigation (MVV-IN), which receives voice commands and analyzes multimodal information of visual observation in order to enhance robots' environment understanding. We make use of single RGB images taken by a first-view monocular camera. We also apply a self-attention mechanism to keep the agent focusing on key areas. Memory is important for the agent to avoid repeating certain tasks unnecessarily and in order for it to adapt adequately to new scenes, therefore, we make use of meta-learning. We have experimented with various functional features extracted from visual observation. Comparative experiments prove that our methods outperform state-of-the-art baselines.</description><identifier>DOI: 10.48550/arxiv.2009.00402</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2020-09</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2009.00402$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2009.00402$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Yan, Liqi</creatorcontrib><creatorcontrib>Liu, Dongfang</creatorcontrib><creatorcontrib>Song, Yaoxian</creatorcontrib><creatorcontrib>Yu, Changbin</creatorcontrib><title>Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning</title><description>Vision and voice are two vital keys for agents' interaction and learning. In this paper, we present a novel indoor navigation model called Memory Vision-Voice Indoor Navigation (MVV-IN), which receives voice commands and analyzes multimodal information of visual observation in order to enhance robots' environment understanding. We make use of single RGB images taken by a first-view monocular camera. We also apply a self-attention mechanism to keep the agent focusing on key areas. Memory is important for the agent to avoid repeating certain tasks unnecessarily and in order for it to adapt adequately to new scenes, therefore, we make use of meta-learning. We have experimented with various functional features extracted from visual observation. Comparative experiments prove that our methods outperform state-of-the-art baselines.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tOwzAURL1hgQofwAr_QMLFcWJ7GVU8KqWwqbpCii6xb2opiSM3FPr3pI_VLM7MSIexh2dIpc5zeML45w-pADApgARxy77WP93k-2Cx42XbRtfi5MPAy3GMAZsdpxD52vUhHvnW72eUbINvHF8NNszoAw_-Ovn1026uTphUDuPgh_aO3RB2e3d_zQXbvL5slu9J9fm2WpZVgoUSSaE1OshyAxnZXEDRkBM5WqIMCbNCWOEUFdISGiNIai2V-rYWjWokzI0Fe7zcnvXqMfoe47E-adZnzewfvUJO3g</recordid><startdate>20200901</startdate><enddate>20200901</enddate><creator>Yan, Liqi</creator><creator>Liu, Dongfang</creator><creator>Song, Yaoxian</creator><creator>Yu, Changbin</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20200901</creationdate><title>Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning</title><author>Yan, Liqi ; Liu, Dongfang ; Song, Yaoxian ; Yu, Changbin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-688ae035903fd5206cfe25adff3afa362d2e7f64dfa992f488477bdda97c40fa3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Yan, Liqi</creatorcontrib><creatorcontrib>Liu, Dongfang</creatorcontrib><creatorcontrib>Song, Yaoxian</creatorcontrib><creatorcontrib>Yu, Changbin</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Yan, Liqi</au><au>Liu, Dongfang</au><au>Song, Yaoxian</au><au>Yu, Changbin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning</atitle><date>2020-09-01</date><risdate>2020</risdate><abstract>Vision and voice are two vital keys for agents' interaction and learning. In this paper, we present a novel indoor navigation model called Memory Vision-Voice Indoor Navigation (MVV-IN), which receives voice commands and analyzes multimodal information of visual observation in order to enhance robots' environment understanding. We make use of single RGB images taken by a first-view monocular camera. We also apply a self-attention mechanism to keep the agent focusing on key areas. Memory is important for the agent to avoid repeating certain tasks unnecessarily and in order for it to adapt adequately to new scenes, therefore, we make use of meta-learning. We have experimented with various functional features extracted from visual observation. Comparative experiments prove that our methods outperform state-of-the-art baselines.</abstract><doi>10.48550/arxiv.2009.00402</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2009.00402
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2009_00402
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-14T09%3A51%3A12IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Multimodal%20Aggregation%20Approach%20for%20Memory%20Vision-Voice%20Indoor%20Navigation%20with%20Meta-Learning&rft.au=Yan,%20Liqi&rft.date=2020-09-01&rft_id=info:doi/10.48550/arxiv.2009.00402&rft_dat=%3Carxiv_GOX%3E2009_00402%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true