Toward Robust LiDAR-Camera Fusion in BEV Space via Mutual Deformable Attention and Temporal Aggregation

LiDAR and camera are two critical sensors that can provide complementary information for accurate 3D object detection. Most works are devoted to improving the detection performance of fusion models on the clean and well-collected datasets. However, the collected point clouds and images in real scena...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on circuits and systems for video technology 2024-07, Vol.34 (7), p.5753-5764
Hauptverfasser:	Wang, Jian, Li, Fan, An, Yi, Zhang, Xuchong, Sun, Hongbin
Format:	Artikel
Sprache:	eng
Schlagworte:	3D object detection Cameras Datasets Detectors Effectiveness Feature extraction Formability Laser radar Lidar LiDAR-camera fusion Malfunctions model robustness Object detection Object recognition Robustness Sensors Three-dimensional displays
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	LiDAR and camera are two critical sensors that can provide complementary information for accurate 3D object detection. Most works are devoted to improving the detection performance of fusion models on the clean and well-collected datasets. However, the collected point clouds and images in real scenarios may be corrupted to various degrees due to potential sensor malfunctions, which greatly affects the robustness of the fusion model and poses a threat to safe deployment. In this paper, we first analyze the shortcomings of most fusion detectors, which rely mainly on the LiDAR branch, and the potential of the bird's eye-view (BEV) paradigm in dealing with partial sensor failures. Based on that, we present a robust LiDAR-camera fusion pipeline in unified BEV space with two novel designs under four typical LiDAR-camera malfunction cases. Specifically, a mutual deformable attention is proposed to dynamically model the spatial feature relationship and reduce the interference caused by the corrupted modality, and a temporal aggregation module is devised to fully utilize the rich information in the temporal domain. Together with the decoupled feature extraction for each modality and holistic BEV space fusion, the proposed detector, termed RobBEV, can work stably regardless of single-modality data corruption. Extensive experiments on the large-scale nuScenes dataset under robust settings demonstrate the effectiveness of our approach.
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2024.3366664