Uni-DPM: Unifying Self-Supervised Monocular Depth, Pose, and Object Motion Estimation with a Shared Representation

Self-supervised monocular depth estimation has been widely studied for 3D perception, as it can infer depth, pose, and object motion from monocular videos. However, existing single-view and multi-view methods employ separate networks to learn specific representations for these different tasks. This...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on multimedia 2025-01, p.1-15
Hauptverfasser:	Wu, Guanghui, Chen, Lili, Chen, Zengping
Format:	Artikel
Sprache:	eng
Schlagworte:	Cameras Computer vision Depth measurement Estimation Geometry Image motion analysis monocular depth estimation Motion segmentation odometry Optical flow scene flow estimation Self-supervised learning Three-dimensional displays Videos
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Self-supervised monocular depth estimation has been widely studied for 3D perception, as it can infer depth, pose, and object motion from monocular videos. However, existing single-view and multi-view methods employ separate networks to learn specific representations for these different tasks. This not only results in a cumbersome model architecture but also limits the representation capacity. In this paper, we revisit previous methods and have the following insights: (1) these three tasks are reciprocal and all depend on matching information; and (2) different representations carry complementary information. Based on these insights, we propose Uni-DPM, a compact self-supervised framework to complete these three tasks with a shared representation. Specifically, we introduce an U-net-like model to synchronously complete multiple tasks by leveraging their common dependence on matching information, and iteratively refine the predictions by utilizing the reciprocity among tasks. Furthermore, we design a shared Appearance- Matching-Temporal (AMT) representation for these three tasks by exploiting the complementarity among different types of information. In addition, our Uni-DPM is scalable to downstream tasks, including scene flow, optical flow, and motion segmentation. Comparative experiments demonstrate the competitiveness of our Uni-DPM on these tasks, while ablation experiments also verify our insights.
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2024.3521846