XVO: Generalized Visual Odometry via Cross-Modal Self-Training
We propose XVO, a semi-supervised learning method for training generalized monocular Visual Odometry (VO) models with robust off-the-self operation across diverse datasets and settings. In contrast to standard monocular VO approaches which often study a known calibration within a single dataset, XVO...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We propose XVO, a semi-supervised learning method for training generalized
monocular Visual Odometry (VO) models with robust off-the-self operation across
diverse datasets and settings. In contrast to standard monocular VO approaches
which often study a known calibration within a single dataset, XVO efficiently
learns to recover relative pose with real-world scale from visual scene
semantics, i.e., without relying on any known camera parameters. We optimize
the motion estimation model via self-training from large amounts of
unconstrained and heterogeneous dash camera videos available on YouTube. Our
key contribution is twofold. First, we empirically demonstrate the benefits of
semi-supervised training for learning a general-purpose direct VO regression
network. Second, we demonstrate multi-modal supervision, including
segmentation, flow, depth, and audio auxiliary prediction tasks, to facilitate
generalized representations for the VO task. Specifically, we find audio
prediction task to significantly enhance the semi-supervised learning process
while alleviating noisy pseudo-labels, particularly in highly dynamic and
out-of-domain video data. Our proposed teacher network achieves
state-of-the-art performance on the commonly used KITTI benchmark despite no
multi-frame optimization or knowledge of camera parameters. Combined with the
proposed semi-supervised step, XVO demonstrates off-the-shelf knowledge
transfer across diverse conditions on KITTI, nuScenes, and Argoverse without
fine-tuning. |
---|---|
DOI: | 10.48550/arxiv.2309.16772 |