FS-Depth: Focal-and-Scale Depth Estimation From a Single Image in Unseen Indoor Scene

It has long been an ill-posed problem to predict absolute depth maps from single images in unseen scenes. We observe that it is essentially due to not only the scale-ambiguous problem, but more importantly, the focal-ambiguous problem that decreases the generalization ability of monocular depth esti...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on circuits and systems for video technology 2024-11, Vol.34 (11), p.10604-10617
Hauptverfasser: Wei, Chengrui, Yang, Meng, He, Lei, Zheng, Nanning
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:It has long been an ill-posed problem to predict absolute depth maps from single images in unseen scenes. We observe that it is essentially due to not only the scale-ambiguous problem, but more importantly, the focal-ambiguous problem that decreases the generalization ability of monocular depth estimation. That is, images may be captured by cameras of different focal lengths in scenes of different scales. In this paper, we develop a focal-and-scale depth estimation model to well learn absolute depth maps from single images in unseen indoor scenes. First, a relative depth estimation network is adopted to learn relative depths from single images with diverse scales. Second, multi-scale features are generated by mapping a single focal length value to focal length features and concatenating them with intermediate features of different scales in relative depth estimation. Finally, relative depths and multi-scale features are jointly fed into an absolute depth estimation network. Our model is enabled to be well trained on either a single dataset or a mixed dataset with diverse focal lengths and scene scales by a dual-directional alignment strategy. In addition, a new pipeline is developed to augment the diversity of focal lengths of public datasets, which are often captured with cameras of the same or similar focal lengths. The experiments verify that our model trained on NYUDv2 significantly improves the generalization ability of monocular depth estimation by 32%/14% (RMSE) on three unseen datasets with/without data augmentation compared with state-of-the-art (SOTA) baselines, and well alleviates the deformation problem of depth maps in 3D view. The generalization ability is further improved by 16% when the model is trained on a mixture of NYUDv2 and SUNRGBD. In addition, our model maintains a SOTA accuracy, when it is trained and tested on NYUDv2 similar to existing models. The code is released on https://github.com/wcrwcrwcr/FS-Depth-v1 .
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2024.3411688