Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning

Vision and voice are two vital keys for agents' interaction and learning. In this paper, we present a novel indoor navigation model called Memory Vision-Voice Indoor Navigation (MVV-IN), which receives voice commands and analyzes multimodal information of visual observation in order to enhance...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Yan, Liqi, Liu, Dongfang, Song, Yaoxian, Yu, Changbin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Yan, Liqi
Liu, Dongfang
Song, Yaoxian
Yu, Changbin
description Vision and voice are two vital keys for agents' interaction and learning. In this paper, we present a novel indoor navigation model called Memory Vision-Voice Indoor Navigation (MVV-IN), which receives voice commands and analyzes multimodal information of visual observation in order to enhance robots' environment understanding. We make use of single RGB images taken by a first-view monocular camera. We also apply a self-attention mechanism to keep the agent focusing on key areas. Memory is important for the agent to avoid repeating certain tasks unnecessarily and in order for it to adapt adequately to new scenes, therefore, we make use of meta-learning. We have experimented with various functional features extracted from visual observation. Comparative experiments prove that our methods outperform state-of-the-art baselines.
doi_str_mv 10.48550/arxiv.2009.00402
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2009_00402</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2009_00402</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-688ae035903fd5206cfe25adff3afa362d2e7f64dfa992f488477bdda97c40fa3</originalsourceid><addsrcrecordid>eNotj8tOwzAURL1hgQofwAr_QMLFcWJ7GVU8KqWwqbpCii6xb2opiSM3FPr3pI_VLM7MSIexh2dIpc5zeML45w-pADApgARxy77WP93k-2Cx42XbRtfi5MPAy3GMAZsdpxD52vUhHvnW72eUbINvHF8NNszoAw_-Ovn1026uTphUDuPgh_aO3RB2e3d_zQXbvL5slu9J9fm2WpZVgoUSSaE1OshyAxnZXEDRkBM5WqIMCbNCWOEUFdISGiNIai2V-rYWjWokzI0Fe7zcnvXqMfoe47E-adZnzewfvUJO3g</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning</title><source>arXiv.org</source><creator>Yan, Liqi ; Liu, Dongfang ; Song, Yaoxian ; Yu, Changbin</creator><creatorcontrib>Yan, Liqi ; Liu, Dongfang ; Song, Yaoxian ; Yu, Changbin</creatorcontrib><description>Vision and voice are two vital keys for agents' interaction and learning. In this paper, we present a novel indoor navigation model called Memory Vision-Voice Indoor Navigation (MVV-IN), which receives voice commands and analyzes multimodal information of visual observation in order to enhance robots' environment understanding. We make use of single RGB images taken by a first-view monocular camera. We also apply a self-attention mechanism to keep the agent focusing on key areas. Memory is important for the agent to avoid repeating certain tasks unnecessarily and in order for it to adapt adequately to new scenes, therefore, we make use of meta-learning. We have experimented with various functional features extracted from visual observation. Comparative experiments prove that our methods outperform state-of-the-art baselines.</description><identifier>DOI: 10.48550/arxiv.2009.00402</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2020-09</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2009.00402$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2009.00402$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Yan, Liqi</creatorcontrib><creatorcontrib>Liu, Dongfang</creatorcontrib><creatorcontrib>Song, Yaoxian</creatorcontrib><creatorcontrib>Yu, Changbin</creatorcontrib><title>Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning</title><description>Vision and voice are two vital keys for agents' interaction and learning. In this paper, we present a novel indoor navigation model called Memory Vision-Voice Indoor Navigation (MVV-IN), which receives voice commands and analyzes multimodal information of visual observation in order to enhance robots' environment understanding. We make use of single RGB images taken by a first-view monocular camera. We also apply a self-attention mechanism to keep the agent focusing on key areas. Memory is important for the agent to avoid repeating certain tasks unnecessarily and in order for it to adapt adequately to new scenes, therefore, we make use of meta-learning. We have experimented with various functional features extracted from visual observation. Comparative experiments prove that our methods outperform state-of-the-art baselines.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tOwzAURL1hgQofwAr_QMLFcWJ7GVU8KqWwqbpCii6xb2opiSM3FPr3pI_VLM7MSIexh2dIpc5zeML45w-pADApgARxy77WP93k-2Cx42XbRtfi5MPAy3GMAZsdpxD52vUhHvnW72eUbINvHF8NNszoAw_-Ovn1026uTphUDuPgh_aO3RB2e3d_zQXbvL5slu9J9fm2WpZVgoUSSaE1OshyAxnZXEDRkBM5WqIMCbNCWOEUFdISGiNIai2V-rYWjWokzI0Fe7zcnvXqMfoe47E-adZnzewfvUJO3g</recordid><startdate>20200901</startdate><enddate>20200901</enddate><creator>Yan, Liqi</creator><creator>Liu, Dongfang</creator><creator>Song, Yaoxian</creator><creator>Yu, Changbin</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20200901</creationdate><title>Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning</title><author>Yan, Liqi ; Liu, Dongfang ; Song, Yaoxian ; Yu, Changbin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-688ae035903fd5206cfe25adff3afa362d2e7f64dfa992f488477bdda97c40fa3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Yan, Liqi</creatorcontrib><creatorcontrib>Liu, Dongfang</creatorcontrib><creatorcontrib>Song, Yaoxian</creatorcontrib><creatorcontrib>Yu, Changbin</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Yan, Liqi</au><au>Liu, Dongfang</au><au>Song, Yaoxian</au><au>Yu, Changbin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning</atitle><date>2020-09-01</date><risdate>2020</risdate><abstract>Vision and voice are two vital keys for agents' interaction and learning. In this paper, we present a novel indoor navigation model called Memory Vision-Voice Indoor Navigation (MVV-IN), which receives voice commands and analyzes multimodal information of visual observation in order to enhance robots' environment understanding. We make use of single RGB images taken by a first-view monocular camera. We also apply a self-attention mechanism to keep the agent focusing on key areas. Memory is important for the agent to avoid repeating certain tasks unnecessarily and in order for it to adapt adequately to new scenes, therefore, we make use of meta-learning. We have experimented with various functional features extracted from visual observation. Comparative experiments prove that our methods outperform state-of-the-art baselines.</abstract><doi>10.48550/arxiv.2009.00402</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2009.00402
ispartof
issn
language eng
recordid cdi_arxiv_primary_2009_00402
source arXiv.org
subjects Computer Science - Computer Vision and Pattern Recognition
title Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-14T09%3A51%3A12IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Multimodal%20Aggregation%20Approach%20for%20Memory%20Vision-Voice%20Indoor%20Navigation%20with%20Meta-Learning&rft.au=Yan,%20Liqi&rft.date=2020-09-01&rft_id=info:doi/10.48550/arxiv.2009.00402&rft_dat=%3Carxiv_GOX%3E2009_00402%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true