VMLoc: Variational Fusion For Learning-Based Multimodal Camera Localization

Recent learning-based approaches have achieved impressive results in the field of single-shot camera localization. However, how best to fuse multiple modalities (e.g., image and depth) and to deal with degraded or missing input are less well studied. In particular, we note that previous approaches t...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2023-06
Hauptverfasser:	Zhou, Kaichen, Chen, Changhao, Wang, Bing, Muhamad Risqi U Saputra, Trigoni, Niki, Markham, Andrew
Format:	Artikel
Sprache:	eng
Schlagworte:	Cameras Coders Importance sampling Learning Variational geometry
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Zhou, Kaichen Chen, Changhao Wang, Bing Muhamad Risqi U Saputra Trigoni, Niki Markham, Andrew
description	Recent learning-based approaches have achieved impressive results in the field of single-shot camera localization. However, how best to fuse multiple modalities (e.g., image and depth) and to deal with degraded or missing input are less well studied. In particular, we note that previous approaches towards deep fusion do not perform significantly better than models employing a single modality. We conjecture that this is because of the naive approaches to feature space fusion through summation or concatenation which do not take into account the different strengths of each modality. To address this, we propose an end-to-end framework, termed VMLoc, to fuse different sensor inputs into a common latent space through a variational Product-of-Experts (PoE) followed by attention-based fusion. Unlike previous multimodal variational works directly adapting the objective function of vanilla variational auto-encoder, we show how camera localization can be accurately estimated through an unbiased objective function based on importance weighting. Our model is extensively evaluated on RGB-D datasets and the results prove the efficacy of our model. The source code is available at https://github.com/kaichen-z/VMLoc.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2377807300</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2377807300</sourcerecordid><originalsourceid>FETCH-proquest_journals_23778073003</originalsourceid><addsrcrecordid>eNqNjbEKwjAURYMgWLT_EHAOxKQ1xdFiEWw36VoeNkpK2mhes_j1BvEDnO4ZzuEuSCKk3LEiE2JFUsSBcy72SuS5TMilbWp3O9AWvIHZuAksrQJGoJXztNbgJzM92BFQ97QJdjaj66NUwqg90BiDNe9vuiHLO1jU6W_XZFudruWZPb17BY1zN7jg4wN2QipVcCU5l_9ZH7KiPJA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2377807300</pqid></control><display><type>article</type><title>VMLoc: Variational Fusion For Learning-Based Multimodal Camera Localization</title><source>Free E- Journals</source><creator>Zhou, Kaichen ; Chen, Changhao ; Wang, Bing ; Muhamad Risqi U Saputra ; Trigoni, Niki ; Markham, Andrew</creator><creatorcontrib>Zhou, Kaichen ; Chen, Changhao ; Wang, Bing ; Muhamad Risqi U Saputra ; Trigoni, Niki ; Markham, Andrew</creatorcontrib><description>Recent learning-based approaches have achieved impressive results in the field of single-shot camera localization. However, how best to fuse multiple modalities (e.g., image and depth) and to deal with degraded or missing input are less well studied. In particular, we note that previous approaches towards deep fusion do not perform significantly better than models employing a single modality. We conjecture that this is because of the naive approaches to feature space fusion through summation or concatenation which do not take into account the different strengths of each modality. To address this, we propose an end-to-end framework, termed VMLoc, to fuse different sensor inputs into a common latent space through a variational Product-of-Experts (PoE) followed by attention-based fusion. Unlike previous multimodal variational works directly adapting the objective function of vanilla variational auto-encoder, we show how camera localization can be accurately estimated through an unbiased objective function based on importance weighting. Our model is extensively evaluated on RGB-D datasets and the results prove the efficacy of our model. The source code is available at https://github.com/kaichen-z/VMLoc.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Cameras ; Coders ; Importance sampling ; Learning ; Variational geometry</subject><ispartof>arXiv.org, 2023-06</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Zhou, Kaichen</creatorcontrib><creatorcontrib>Chen, Changhao</creatorcontrib><creatorcontrib>Wang, Bing</creatorcontrib><creatorcontrib>Muhamad Risqi U Saputra</creatorcontrib><creatorcontrib>Trigoni, Niki</creatorcontrib><creatorcontrib>Markham, Andrew</creatorcontrib><title>VMLoc: Variational Fusion For Learning-Based Multimodal Camera Localization</title><title>arXiv.org</title><description>Recent learning-based approaches have achieved impressive results in the field of single-shot camera localization. However, how best to fuse multiple modalities (e.g., image and depth) and to deal with degraded or missing input are less well studied. In particular, we note that previous approaches towards deep fusion do not perform significantly better than models employing a single modality. We conjecture that this is because of the naive approaches to feature space fusion through summation or concatenation which do not take into account the different strengths of each modality. To address this, we propose an end-to-end framework, termed VMLoc, to fuse different sensor inputs into a common latent space through a variational Product-of-Experts (PoE) followed by attention-based fusion. Unlike previous multimodal variational works directly adapting the objective function of vanilla variational auto-encoder, we show how camera localization can be accurately estimated through an unbiased objective function based on importance weighting. Our model is extensively evaluated on RGB-D datasets and the results prove the efficacy of our model. The source code is available at https://github.com/kaichen-z/VMLoc.</description><subject>Cameras</subject><subject>Coders</subject><subject>Importance sampling</subject><subject>Learning</subject><subject>Variational geometry</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNjbEKwjAURYMgWLT_EHAOxKQ1xdFiEWw36VoeNkpK2mhes_j1BvEDnO4ZzuEuSCKk3LEiE2JFUsSBcy72SuS5TMilbWp3O9AWvIHZuAksrQJGoJXztNbgJzM92BFQ97QJdjaj66NUwqg90BiDNe9vuiHLO1jU6W_XZFudruWZPb17BY1zN7jg4wN2QipVcCU5l_9ZH7KiPJA</recordid><startdate>20230622</startdate><enddate>20230622</enddate><creator>Zhou, Kaichen</creator><creator>Chen, Changhao</creator><creator>Wang, Bing</creator><creator>Muhamad Risqi U Saputra</creator><creator>Trigoni, Niki</creator><creator>Markham, Andrew</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20230622</creationdate><title>VMLoc: Variational Fusion For Learning-Based Multimodal Camera Localization</title><author>Zhou, Kaichen ; Chen, Changhao ; Wang, Bing ; Muhamad Risqi U Saputra ; Trigoni, Niki ; Markham, Andrew</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_23778073003</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Cameras</topic><topic>Coders</topic><topic>Importance sampling</topic><topic>Learning</topic><topic>Variational geometry</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhou, Kaichen</creatorcontrib><creatorcontrib>Chen, Changhao</creatorcontrib><creatorcontrib>Wang, Bing</creatorcontrib><creatorcontrib>Muhamad Risqi U Saputra</creatorcontrib><creatorcontrib>Trigoni, Niki</creatorcontrib><creatorcontrib>Markham, Andrew</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhou, Kaichen</au><au>Chen, Changhao</au><au>Wang, Bing</au><au>Muhamad Risqi U Saputra</au><au>Trigoni, Niki</au><au>Markham, Andrew</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>VMLoc: Variational Fusion For Learning-Based Multimodal Camera Localization</atitle><jtitle>arXiv.org</jtitle><date>2023-06-22</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Recent learning-based approaches have achieved impressive results in the field of single-shot camera localization. However, how best to fuse multiple modalities (e.g., image and depth) and to deal with degraded or missing input are less well studied. In particular, we note that previous approaches towards deep fusion do not perform significantly better than models employing a single modality. We conjecture that this is because of the naive approaches to feature space fusion through summation or concatenation which do not take into account the different strengths of each modality. To address this, we propose an end-to-end framework, termed VMLoc, to fuse different sensor inputs into a common latent space through a variational Product-of-Experts (PoE) followed by attention-based fusion. Unlike previous multimodal variational works directly adapting the objective function of vanilla variational auto-encoder, we show how camera localization can be accurately estimated through an unbiased objective function based on importance weighting. Our model is extensively evaluated on RGB-D datasets and the results prove the efficacy of our model. The source code is available at https://github.com/kaichen-z/VMLoc.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2023-06
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2377807300
source	Free E- Journals
subjects	Cameras Coders Importance sampling Learning Variational geometry
title	VMLoc: Variational Fusion For Learning-Based Multimodal Camera Localization
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T04%3A49%3A08IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=VMLoc:%20Variational%20Fusion%20For%20Learning-Based%20Multimodal%20Camera%20Localization&rft.jtitle=arXiv.org&rft.au=Zhou,%20Kaichen&rft.date=2023-06-22&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2377807300%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2377807300&rft_id=info:pmid/&rfr_iscdi=true