Robust LiDAR-Camera Alignment With Modality Adapted Local-to-Global Representation

LiDAR-Camera alignment (LCA) is an important preprocessing procedure for fusing LiDAR and camera data. For it, one key issue is to extract unified cross-modality representation for characterizing the heterogeneous LiDAR and camera data effectively and robustly. The main challenge is to resist the mo...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on circuits and systems for video technology 2023-01, Vol.33 (1), p.59-73
Hauptverfasser: Zhu, Angfan, Xiao, Yang, Liu, Chengxin, Cao, Zhiguo
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:LiDAR-Camera alignment (LCA) is an important preprocessing procedure for fusing LiDAR and camera data. For it, one key issue is to extract unified cross-modality representation for characterizing the heterogeneous LiDAR and camera data effectively and robustly. The main challenge is to resist the modality gap and visual data degradation during feature learning, while still maintaining strong representative power. To address this, a novel modality adapted local-to-global representation learning method is proposed. The research efforts are paid in 2 main folders via modality adaptation and capturing global spatial context. First for modality gap resistance, LiDAR and camera data is projected into the same depth map domain for unified representation learning. Particularly, LiDAR data is converted to depth map according to pre-acquired extrinsic parameters. Thanks to the recent advantage of deep learning based monocular depth estimation, camera data is transformed into depth map in data driven manner, which is jointly optimized with LCA. Secondly to capture global spatial context, ViT (vision transformer) is introduced to LCA. The concept of LCA token is proposed for aggregating the local spatial patterns to form global spatial representation with transformer encoding. And, it is shared by all the samples. In this way, it can involve global sample-level information to leverage generalization ability. The experiments on KITTI dataset verify superiority of our proposition. Furthermore, the proposed approach is more robust to camera data degeneration (e.g., imaging blurring and noise) often faced by the practical applications. Under some challenging test cases, the performance advancement of our method is over 1.9~cm /4.1° on translation / rotation error. While our model size (8.77M) is much smaller than existing methods (e.g., LCCNet of 66.75M). The source code will be released at https://github.com/Zaf233/RLCA upon acceptance.
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2022.3197212