Vision-and-Language Navigation via Causal Learning

In the pursuit of robust and generalizable environment perception and language understanding, the ubiquitous challenge of dataset bias continues to plague vision-and-language navigation (VLN) agents, hindering their performance in unseen environments. This paper introduces the generalized cross-moda...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-04
Hauptverfasser: Wang, Liuyi, He, Zongtao, Dang, Ronghao, Shen, Mengjiao, Liu, Chengju, Chen, Qijun
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Wang, Liuyi
He, Zongtao
Dang, Ronghao
Shen, Mengjiao
Liu, Chengju
Chen, Qijun
description In the pursuit of robust and generalizable environment perception and language understanding, the ubiquitous challenge of dataset bias continues to plague vision-and-language navigation (VLN) agents, hindering their performance in unseen environments. This paper introduces the generalized cross-modal causal transformer (GOAT), a pioneering solution rooted in the paradigm of causal inference. By delving into both observable and unobservable confounders within vision, language, and history, we propose the back-door and front-door adjustment causal learning (BACL and FACL) modules to promote unbiased learning by comprehensively mitigating potential spurious correlations. Additionally, to capture global confounder features, we propose a cross-modal feature pooling (CFP) module supervised by contrastive learning, which is also shown to be effective in improving cross-modal representations during pre-training. Extensive experiments across multiple VLN datasets (R2R, REVERIE, RxR, and SOON) underscore the superiority of our proposed method over previous state-of-the-art approaches. Code is available at https://github.com/CrystalSixone/VLN-GOAT.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3040141434</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3040141434</sourcerecordid><originalsourceid>FETCH-proquest_journals_30401414343</originalsourceid><addsrcrecordid>eNqNikEKwjAQAIMgWLR_CHgOJLupei-Kh-JJvJYFY0gpG02avt8efICngZlZiQoQjTpZgI2ocx601nA4QtNgJeARcoisiJ-qI_aFvJM3moOnafFyDiRbKplG2TlKHNjvxPpFY3b1j1uxv5zv7VW9U_wUl6d-iCXxknrUVhtrLFr87_oCuTAzaQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3040141434</pqid></control><display><type>article</type><title>Vision-and-Language Navigation via Causal Learning</title><source>Free E- Journals</source><creator>Wang, Liuyi ; He, Zongtao ; Dang, Ronghao ; Shen, Mengjiao ; Liu, Chengju ; Chen, Qijun</creator><creatorcontrib>Wang, Liuyi ; He, Zongtao ; Dang, Ronghao ; Shen, Mengjiao ; Liu, Chengju ; Chen, Qijun</creatorcontrib><description>In the pursuit of robust and generalizable environment perception and language understanding, the ubiquitous challenge of dataset bias continues to plague vision-and-language navigation (VLN) agents, hindering their performance in unseen environments. This paper introduces the generalized cross-modal causal transformer (GOAT), a pioneering solution rooted in the paradigm of causal inference. By delving into both observable and unobservable confounders within vision, language, and history, we propose the back-door and front-door adjustment causal learning (BACL and FACL) modules to promote unbiased learning by comprehensively mitigating potential spurious correlations. Additionally, to capture global confounder features, we propose a cross-modal feature pooling (CFP) module supervised by contrastive learning, which is also shown to be effective in improving cross-modal representations during pre-training. Extensive experiments across multiple VLN datasets (R2R, REVERIE, RxR, and SOON) underscore the superiority of our proposed method over previous state-of-the-art approaches. Code is available at https://github.com/CrystalSixone/VLN-GOAT.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Datasets ; Learning ; Modules ; Navigation</subject><ispartof>arXiv.org, 2024-04</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>781,785</link.rule.ids></links><search><creatorcontrib>Wang, Liuyi</creatorcontrib><creatorcontrib>He, Zongtao</creatorcontrib><creatorcontrib>Dang, Ronghao</creatorcontrib><creatorcontrib>Shen, Mengjiao</creatorcontrib><creatorcontrib>Liu, Chengju</creatorcontrib><creatorcontrib>Chen, Qijun</creatorcontrib><title>Vision-and-Language Navigation via Causal Learning</title><title>arXiv.org</title><description>In the pursuit of robust and generalizable environment perception and language understanding, the ubiquitous challenge of dataset bias continues to plague vision-and-language navigation (VLN) agents, hindering their performance in unseen environments. This paper introduces the generalized cross-modal causal transformer (GOAT), a pioneering solution rooted in the paradigm of causal inference. By delving into both observable and unobservable confounders within vision, language, and history, we propose the back-door and front-door adjustment causal learning (BACL and FACL) modules to promote unbiased learning by comprehensively mitigating potential spurious correlations. Additionally, to capture global confounder features, we propose a cross-modal feature pooling (CFP) module supervised by contrastive learning, which is also shown to be effective in improving cross-modal representations during pre-training. Extensive experiments across multiple VLN datasets (R2R, REVERIE, RxR, and SOON) underscore the superiority of our proposed method over previous state-of-the-art approaches. Code is available at https://github.com/CrystalSixone/VLN-GOAT.</description><subject>Datasets</subject><subject>Learning</subject><subject>Modules</subject><subject>Navigation</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNikEKwjAQAIMgWLR_CHgOJLupei-Kh-JJvJYFY0gpG02avt8efICngZlZiQoQjTpZgI2ocx601nA4QtNgJeARcoisiJ-qI_aFvJM3moOnafFyDiRbKplG2TlKHNjvxPpFY3b1j1uxv5zv7VW9U_wUl6d-iCXxknrUVhtrLFr87_oCuTAzaQ</recordid><startdate>20240416</startdate><enddate>20240416</enddate><creator>Wang, Liuyi</creator><creator>He, Zongtao</creator><creator>Dang, Ronghao</creator><creator>Shen, Mengjiao</creator><creator>Liu, Chengju</creator><creator>Chen, Qijun</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240416</creationdate><title>Vision-and-Language Navigation via Causal Learning</title><author>Wang, Liuyi ; He, Zongtao ; Dang, Ronghao ; Shen, Mengjiao ; Liu, Chengju ; Chen, Qijun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30401414343</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Datasets</topic><topic>Learning</topic><topic>Modules</topic><topic>Navigation</topic><toplevel>online_resources</toplevel><creatorcontrib>Wang, Liuyi</creatorcontrib><creatorcontrib>He, Zongtao</creatorcontrib><creatorcontrib>Dang, Ronghao</creatorcontrib><creatorcontrib>Shen, Mengjiao</creatorcontrib><creatorcontrib>Liu, Chengju</creatorcontrib><creatorcontrib>Chen, Qijun</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wang, Liuyi</au><au>He, Zongtao</au><au>Dang, Ronghao</au><au>Shen, Mengjiao</au><au>Liu, Chengju</au><au>Chen, Qijun</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Vision-and-Language Navigation via Causal Learning</atitle><jtitle>arXiv.org</jtitle><date>2024-04-16</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>In the pursuit of robust and generalizable environment perception and language understanding, the ubiquitous challenge of dataset bias continues to plague vision-and-language navigation (VLN) agents, hindering their performance in unseen environments. This paper introduces the generalized cross-modal causal transformer (GOAT), a pioneering solution rooted in the paradigm of causal inference. By delving into both observable and unobservable confounders within vision, language, and history, we propose the back-door and front-door adjustment causal learning (BACL and FACL) modules to promote unbiased learning by comprehensively mitigating potential spurious correlations. Additionally, to capture global confounder features, we propose a cross-modal feature pooling (CFP) module supervised by contrastive learning, which is also shown to be effective in improving cross-modal representations during pre-training. Extensive experiments across multiple VLN datasets (R2R, REVERIE, RxR, and SOON) underscore the superiority of our proposed method over previous state-of-the-art approaches. Code is available at https://github.com/CrystalSixone/VLN-GOAT.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-04
issn 2331-8422
language eng
recordid cdi_proquest_journals_3040141434
source Free E- Journals
subjects Datasets
Learning
Modules
Navigation
title Vision-and-Language Navigation via Causal Learning
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-18T03%3A36%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Vision-and-Language%20Navigation%20via%20Causal%20Learning&rft.jtitle=arXiv.org&rft.au=Wang,%20Liuyi&rft.date=2024-04-16&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3040141434%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3040141434&rft_id=info:pmid/&rfr_iscdi=true