SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
The academic field of learning instruction-guided visual navigation can be generally categorized into high-level category-specific search and low-level language-guided navigation, depending on the granularity of language instruction, in which the former emphasizes the exploration process, while the...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Zhou, Gengze Hong, Yicong Wang, Zun Zhao, Chongyang Bansal, Mohit Wu, Qi |
description | The academic field of learning instruction-guided visual navigation can be
generally categorized into high-level category-specific search and low-level
language-guided navigation, depending on the granularity of language
instruction, in which the former emphasizes the exploration process, while the
latter concentrates on following detailed textual commands. Despite the
differing focuses of these tasks, the underlying requirements of interpreting
instructions, comprehending the surroundings, and inferring action decisions
remain consistent. This paper consolidates diverse navigation tasks into a
unified and generic framework -- we investigate the core difficulties of
sharing general knowledge and exploiting task-specific capabilities in learning
navigation and propose a novel State-Adaptive Mixture of Experts (SAME) model
that effectively enables an agent to infer decisions based on
different-granularity language and dynamic observations. Powered by SAME, we
present a versatile agent capable of addressing seven navigation tasks
simultaneously that outperforms or achieves highly comparable performance to
task-specific agents. |
doi_str_mv | 10.48550/arxiv.2412.05552 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2412_05552</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2412_05552</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2412_055523</originalsourceid><addsrcrecordid>eNqFzjEOgkAQQNFtLIx6ACvnAiAgmxg7YhALsMHYbiYy4CS4kGVBvL2R2Fv95hdPiLXvueFeSm-LZuTBDUI_cD0pZTAXKo-y-AApodGsK0hIk-E7pKirHitykp4LKuDGXY81XHDgCi03Gl5sH5BbtOREBbaWB4KMR9sbgqaEeGzJ2G4pZiXWHa1-XYjNKb4ez85EUa3hJ5q3-pLURNr9Pz6Oz0Fb</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts</title><source>arXiv.org</source><creator>Zhou, Gengze ; Hong, Yicong ; Wang, Zun ; Zhao, Chongyang ; Bansal, Mohit ; Wu, Qi</creator><creatorcontrib>Zhou, Gengze ; Hong, Yicong ; Wang, Zun ; Zhao, Chongyang ; Bansal, Mohit ; Wu, Qi</creatorcontrib><description>The academic field of learning instruction-guided visual navigation can be
generally categorized into high-level category-specific search and low-level
language-guided navigation, depending on the granularity of language
instruction, in which the former emphasizes the exploration process, while the
latter concentrates on following detailed textual commands. Despite the
differing focuses of these tasks, the underlying requirements of interpreting
instructions, comprehending the surroundings, and inferring action decisions
remain consistent. This paper consolidates diverse navigation tasks into a
unified and generic framework -- we investigate the core difficulties of
sharing general knowledge and exploiting task-specific capabilities in learning
navigation and propose a novel State-Adaptive Mixture of Experts (SAME) model
that effectively enables an agent to infer decisions based on
different-granularity language and dynamic observations. Powered by SAME, we
present a versatile agent capable of addressing seven navigation tasks
simultaneously that outperforms or achieves highly comparable performance to
task-specific agents.</description><identifier>DOI: 10.48550/arxiv.2412.05552</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Learning ; Computer Science - Robotics</subject><creationdate>2024-12</creationdate><rights>http://creativecommons.org/licenses/by-sa/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2412.05552$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2412.05552$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhou, Gengze</creatorcontrib><creatorcontrib>Hong, Yicong</creatorcontrib><creatorcontrib>Wang, Zun</creatorcontrib><creatorcontrib>Zhao, Chongyang</creatorcontrib><creatorcontrib>Bansal, Mohit</creatorcontrib><creatorcontrib>Wu, Qi</creatorcontrib><title>SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts</title><description>The academic field of learning instruction-guided visual navigation can be
generally categorized into high-level category-specific search and low-level
language-guided navigation, depending on the granularity of language
instruction, in which the former emphasizes the exploration process, while the
latter concentrates on following detailed textual commands. Despite the
differing focuses of these tasks, the underlying requirements of interpreting
instructions, comprehending the surroundings, and inferring action decisions
remain consistent. This paper consolidates diverse navigation tasks into a
unified and generic framework -- we investigate the core difficulties of
sharing general knowledge and exploiting task-specific capabilities in learning
navigation and propose a novel State-Adaptive Mixture of Experts (SAME) model
that effectively enables an agent to infer decisions based on
different-granularity language and dynamic observations. Powered by SAME, we
present a versatile agent capable of addressing seven navigation tasks
simultaneously that outperforms or achieves highly comparable performance to
task-specific agents.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Robotics</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFzjEOgkAQQNFtLIx6ACvnAiAgmxg7YhALsMHYbiYy4CS4kGVBvL2R2Fv95hdPiLXvueFeSm-LZuTBDUI_cD0pZTAXKo-y-AApodGsK0hIk-E7pKirHitykp4LKuDGXY81XHDgCi03Gl5sH5BbtOREBbaWB4KMR9sbgqaEeGzJ2G4pZiXWHa1-XYjNKb4ez85EUa3hJ5q3-pLURNr9Pz6Oz0Fb</recordid><startdate>20241207</startdate><enddate>20241207</enddate><creator>Zhou, Gengze</creator><creator>Hong, Yicong</creator><creator>Wang, Zun</creator><creator>Zhao, Chongyang</creator><creator>Bansal, Mohit</creator><creator>Wu, Qi</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241207</creationdate><title>SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts</title><author>Zhou, Gengze ; Hong, Yicong ; Wang, Zun ; Zhao, Chongyang ; Bansal, Mohit ; Wu, Qi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2412_055523</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Robotics</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhou, Gengze</creatorcontrib><creatorcontrib>Hong, Yicong</creatorcontrib><creatorcontrib>Wang, Zun</creatorcontrib><creatorcontrib>Zhao, Chongyang</creatorcontrib><creatorcontrib>Bansal, Mohit</creatorcontrib><creatorcontrib>Wu, Qi</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhou, Gengze</au><au>Hong, Yicong</au><au>Wang, Zun</au><au>Zhao, Chongyang</au><au>Bansal, Mohit</au><au>Wu, Qi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts</atitle><date>2024-12-07</date><risdate>2024</risdate><abstract>The academic field of learning instruction-guided visual navigation can be
generally categorized into high-level category-specific search and low-level
language-guided navigation, depending on the granularity of language
instruction, in which the former emphasizes the exploration process, while the
latter concentrates on following detailed textual commands. Despite the
differing focuses of these tasks, the underlying requirements of interpreting
instructions, comprehending the surroundings, and inferring action decisions
remain consistent. This paper consolidates diverse navigation tasks into a
unified and generic framework -- we investigate the core difficulties of
sharing general knowledge and exploiting task-specific capabilities in learning
navigation and propose a novel State-Adaptive Mixture of Experts (SAME) model
that effectively enables an agent to infer decisions based on
different-granularity language and dynamic observations. Powered by SAME, we
present a versatile agent capable of addressing seven navigation tasks
simultaneously that outperforms or achieves highly comparable performance to
task-specific agents.</abstract><doi>10.48550/arxiv.2412.05552</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2412.05552 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2412_05552 |
source | arXiv.org |
subjects | Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning Computer Science - Robotics |
title | SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-01T04%3A31%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SAME:%20Learning%20Generic%20Language-Guided%20Visual%20Navigation%20with%20State-Adaptive%20Mixture%20of%20Experts&rft.au=Zhou,%20Gengze&rft.date=2024-12-07&rft_id=info:doi/10.48550/arxiv.2412.05552&rft_dat=%3Carxiv_GOX%3E2412_05552%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |