EDFIDepth: enriched multi-path vision transformer feature interaction networks for monocular depth estimation

Monocular depth estimation (MDE) aims to predict pixel-level dense depth maps from a single RGB image. Some recent approaches mainly rely on encoder–decoder architectures to capture and process multi-scale features. However, they usually exploit heavier network at the expense of computational costs...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:The Journal of supercomputing 2024-09, Vol.80 (14), p.21023-21047
Hauptverfasser: Xia, Chenxing, Zhang, Mengge, Gao, Xiuju, Ge, Bin, Li, Kuan-Ching, Fang, Xianjin, Zhang, Yan, Liang, Xingzhu
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Monocular depth estimation (MDE) aims to predict pixel-level dense depth maps from a single RGB image. Some recent approaches mainly rely on encoder–decoder architectures to capture and process multi-scale features. However, they usually exploit heavier network at the expense of computational costs to obtain high-quality depth maps. In this paper, we propose a novel enriched multi-path vision transformer feature interaction network with an encoder–decoder architecture, denoted as EDFIDepth , which seeks a balance between computational costs and performance rather than pursuing the highest accuracy or extremely lightweight models. Specifically, an encoder called MPViT-D, incorporating multi-path vision transformer and a deep convolution module, is introduced to extract diverse features with both fine and coarse details at the same feature level with fewer parameters. Subsequently, we propose a lightweight decoder comprising two effective modules to establish multi-scale feature interaction: an encoder–decoder cross-feature matching (ED-CFM) module and a channel-level feature fusion (CLFF) module. The ED-CFM module is to establish connections between encoder–decoder features through a dual-path structure, where a cross-attention mechanism is deployed to enhance the relevance of multi-scale complementary depth information. Meanwhile, the CLFF module utilizes a channel attention mechanism to further fuse crucial depth information within the channels, thereby improving the accuracy of depth estimation. Extensive experiments on the indoor dataset NYUv2 and the outdoor dataset KITTI demonstrate that our method can achieve comparable state-of-the-art (SOTA) results while significantly reducing the number of trainable parameters. Our codes and approach are available at https://github.com/Zhangmg123/EDFIDEpth.
ISSN:0920-8542
1573-0484
DOI:10.1007/s11227-024-06205-7