Think Holistically, Act Down-to-Earth: A Semantic Navigation Strategy With Continuous Environmental Representation and Multi-Step Forward Planning

The Object goal Navigation (ObjectNav) task requires an agent to navigate through a previously unknown domestic scenario using spatial and semantic contextual information, where the goal is specified by a semantic label (e.g., find a TV). Such a task is especially challenging as it requires formulat...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on circuits and systems for video technology 2024-05, Vol.34 (5), p.3860-3875
Hauptverfasser: Chen, Bolei, Kang, Jiaxu, Zhong, Ping, Cui, Yongzheng, Lu, Siyi, Liang, Yixiong, Wang, Jianxin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The Object goal Navigation (ObjectNav) task requires an agent to navigate through a previously unknown domestic scenario using spatial and semantic contextual information, where the goal is specified by a semantic label (e.g., find a TV). Such a task is especially challenging as it requires formulating and understanding the complex co-occurrence relations among objects in diverse settings, which is critical for long-sequence navigational decision-making. Existing methods learn to either explicitly represent co-occurrence relationships as discrete semantic priors, or implicitly encode them from raw observations, thus can not benefit from the rich environmental semantics. In this work, we propose a novel Deep Reinforcement Learning (DRL) based ObjectNav strategy by actively imagining spatial and semantic clues outside the agent's Field of View (FoV) and further mining Continuous Environmental Representations (CER) using self-supervised learning. Additionally, the illusion of spatial and semantic patterns allows the agent to perform Multi-Step Forward-Looking Planning (MSFLP) by considering the temporal evolution of egocentric local observations. Our approach is thoroughly evaluated and ablated in the visually realistic environments of the Matterport3D (MP3D) dataset. The experimental results reflect that our method combining CER and imagination-based MSFLP facilitates learning complicated semantic priors and navigation skills, thus achieving state-of-the-art performance on the ObjectNav task. In addition, adequate quantitative and qualitative analyses validate the excellent generalization ability and superiority of our method.
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2023.3324380