Depth-Relative Self Attention for Monocular Depth Estimation
Monocular depth estimation is very challenging because clues to the exact depth are incomplete in a single RGB image. To overcome the limitation, deep neural networks rely on various visual hints such as size, shade, and texture extracted from RGB information. However, we observe that if such hints...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Monocular depth estimation is very challenging because clues to the exact
depth are incomplete in a single RGB image. To overcome the limitation, deep
neural networks rely on various visual hints such as size, shade, and texture
extracted from RGB information. However, we observe that if such hints are
overly exploited, the network can be biased on RGB information without
considering the comprehensive view. We propose a novel depth estimation model
named RElative Depth Transformer (RED-T) that uses relative depth as guidance
in self-attention. Specifically, the model assigns high attention weights to
pixels of close depth and low attention weights to pixels of distant depth. As
a result, the features of similar depth can become more likely to each other
and thus less prone to misused visual hints. We show that the proposed model
achieves competitive results in monocular depth estimation benchmarks and is
less biased to RGB information. In addition, we propose a novel monocular depth
estimation benchmark that limits the observable depth range during training in
order to evaluate the robustness of the model for unseen depths. |
---|---|
DOI: | 10.48550/arxiv.2304.12849 |