Vision-and-Language Navigation via Causal Learning

In the pursuit of robust and generalizable environment perception and language understanding, the ubiquitous challenge of dataset bias continues to plague vision-and-language navigation (VLN) agents, hindering their performance in unseen environments. This paper introduces the generalized cross-moda...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-04
Hauptverfasser:	Wang, Liuyi, He, Zongtao, Dang, Ronghao, Shen, Mengjiao, Liu, Chengju, Chen, Qijun
Format:	Artikel
Sprache:	eng
Schlagworte:	Datasets Learning Modules Navigation
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Wang, Liuyi He, Zongtao Dang, Ronghao Shen, Mengjiao Liu, Chengju Chen, Qijun
description	In the pursuit of robust and generalizable environment perception and language understanding, the ubiquitous challenge of dataset bias continues to plague vision-and-language navigation (VLN) agents, hindering their performance in unseen environments. This paper introduces the generalized cross-modal causal transformer (GOAT), a pioneering solution rooted in the paradigm of causal inference. By delving into both observable and unobservable confounders within vision, language, and history, we propose the back-door and front-door adjustment causal learning (BACL and FACL) modules to promote unbiased learning by comprehensively mitigating potential spurious correlations. Additionally, to capture global confounder features, we propose a cross-modal feature pooling (CFP) module supervised by contrastive learning, which is also shown to be effective in improving cross-modal representations during pre-training. Extensive experiments across multiple VLN datasets (R2R, REVERIE, RxR, and SOON) underscore the superiority of our proposed method over previous state-of-the-art approaches. Code is available at https://github.com/CrystalSixone/VLN-GOAT.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3040141434</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3040141434</sourcerecordid><originalsourceid>FETCH-proquest_journals_30401414343</originalsourceid><addsrcrecordid>eNqNikEKwjAQAIMgWLR_CHgOJLupei-Kh-JJvJYFY0gpG02avt8efICngZlZiQoQjTpZgI2ocx601nA4QtNgJeARcoisiJ-qI_aFvJM3moOnafFyDiRbKplG2TlKHNjvxPpFY3b1j1uxv5zv7VW9U_wUl6d-iCXxknrUVhtrLFr87_oCuTAzaQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3040141434</pqid></control><display><type>article</type><title>Vision-and-Language Navigation via Causal Learning</title><source>Free E- Journals</source><creator>Wang, Liuyi ; He, Zongtao ; Dang, Ronghao ; Shen, Mengjiao ; Liu, Chengju ; Chen, Qijun</creator><creatorcontrib>Wang, Liuyi ; He, Zongtao ; Dang, Ronghao ; Shen, Mengjiao ; Liu, Chengju ; Chen, Qijun</creatorcontrib><description>In the pursuit of robust and generalizable environment perception and language understanding, the ubiquitous challenge of dataset bias continues to plague vision-and-language navigation (VLN) agents, hindering their performance in unseen environments. This paper introduces the generalized cross-modal causal transformer (GOAT), a pioneering solution rooted in the paradigm of causal inference. By delving into both observable and unobservable confounders within vision, language, and history, we propose the back-door and front-door adjustment causal learning (BACL and FACL) modules to promote unbiased learning by comprehensively mitigating potential spurious correlations. Additionally, to capture global confounder features, we propose a cross-modal feature pooling (CFP) module supervised by contrastive learning, which is also shown to be effective in improving cross-modal representations during pre-training. Extensive experiments across multiple VLN datasets (R2R, REVERIE, RxR, and SOON) underscore the superiority of our proposed method over previous state-of-the-art approaches. Code is available at https://github.com/CrystalSixone/VLN-GOAT.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Datasets ; Learning ; Modules ; Navigation</subject><ispartof>arXiv.org, 2024-04</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>781,785</link.rule.ids></links><search><creatorcontrib>Wang, Liuyi</creatorcontrib><creatorcontrib>He, Zongtao</creatorcontrib><creatorcontrib>Dang, Ronghao</creatorcontrib><creatorcontrib>Shen, Mengjiao</creatorcontrib><creatorcontrib>Liu, Chengju</creatorcontrib><creatorcontrib>Chen, Qijun</creatorcontrib><title>Vision-and-Language Navigation via Causal Learning</title><title>arXiv.org</title><description>In the pursuit of robust and generalizable environment perception and language understanding, the ubiquitous challenge of dataset bias continues to plague vision-and-language navigation (VLN) agents, hindering their performance in unseen environments. This paper introduces the generalized cross-modal causal transformer (GOAT), a pioneering solution rooted in the paradigm of causal inference. By delving into both observable and unobservable confounders within vision, language, and history, we propose the back-door and front-door adjustment causal learning (BACL and FACL) modules to promote unbiased learning by comprehensively mitigating potential spurious correlations. Additionally, to capture global confounder features, we propose a cross-modal feature pooling (CFP) module supervised by contrastive learning, which is also shown to be effective in improving cross-modal representations during pre-training. Extensive experiments across multiple VLN datasets (R2R, REVERIE, RxR, and SOON) underscore the superiority of our proposed method over previous state-of-the-art approaches. Code is available at https://github.com/CrystalSixone/VLN-GOAT.</description><subject>Datasets</subject><subject>Learning</subject><subject>Modules</subject><subject>Navigation</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNikEKwjAQAIMgWLR_CHgOJLupei-Kh-JJvJYFY0gpG02avt8efICngZlZiQoQjTpZgI2ocx601nA4QtNgJeARcoisiJ-qI_aFvJM3moOnafFyDiRbKplG2TlKHNjvxPpFY3b1j1uxv5zv7VW9U_wUl6d-iCXxknrUVhtrLFr87_oCuTAzaQ</recordid><startdate>20240416</startdate><enddate>20240416</enddate><creator>Wang, Liuyi</creator><creator>He, Zongtao</creator><creator>Dang, Ronghao</creator><creator>Shen, Mengjiao</creator><creator>Liu, Chengju</creator><creator>Chen, Qijun</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240416</creationdate><title>Vision-and-Language Navigation via Causal Learning</title><author>Wang, Liuyi ; He, Zongtao ; Dang, Ronghao ; Shen, Mengjiao ; Liu, Chengju ; Chen, Qijun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30401414343</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Datasets</topic><topic>Learning</topic><topic>Modules</topic><topic>Navigation</topic><toplevel>online_resources</toplevel><creatorcontrib>Wang, Liuyi</creatorcontrib><creatorcontrib>He, Zongtao</creatorcontrib><creatorcontrib>Dang, Ronghao</creatorcontrib><creatorcontrib>Shen, Mengjiao</creatorcontrib><creatorcontrib>Liu, Chengju</creatorcontrib><creatorcontrib>Chen, Qijun</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wang, Liuyi</au><au>He, Zongtao</au><au>Dang, Ronghao</au><au>Shen, Mengjiao</au><au>Liu, Chengju</au><au>Chen, Qijun</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Vision-and-Language Navigation via Causal Learning</atitle><jtitle>arXiv.org</jtitle><date>2024-04-16</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>In the pursuit of robust and generalizable environment perception and language understanding, the ubiquitous challenge of dataset bias continues to plague vision-and-language navigation (VLN) agents, hindering their performance in unseen environments. This paper introduces the generalized cross-modal causal transformer (GOAT), a pioneering solution rooted in the paradigm of causal inference. By delving into both observable and unobservable confounders within vision, language, and history, we propose the back-door and front-door adjustment causal learning (BACL and FACL) modules to promote unbiased learning by comprehensively mitigating potential spurious correlations. Additionally, to capture global confounder features, we propose a cross-modal feature pooling (CFP) module supervised by contrastive learning, which is also shown to be effective in improving cross-modal representations during pre-training. Extensive experiments across multiple VLN datasets (R2R, REVERIE, RxR, and SOON) underscore the superiority of our proposed method over previous state-of-the-art approaches. Code is available at https://github.com/CrystalSixone/VLN-GOAT.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-04
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3040141434
source	Free E- Journals
subjects	Datasets Learning Modules Navigation
title	Vision-and-Language Navigation via Causal Learning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-18T03%3A36%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Vision-and-Language%20Navigation%20via%20Causal%20Learning&rft.jtitle=arXiv.org&rft.au=Wang,%20Liuyi&rft.date=2024-04-16&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3040141434%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3040141434&rft_id=info:pmid/&rfr_iscdi=true