VoxelNextFusion: A Simple, Unified, and Effective Voxel Fusion Framework for Multimodal 3-D Object Detection

Light detection and ranging (LiDAR)-camera fusion can enhance the performance of 3-D object detection by utilizing complementary information between depth-aware LiDAR points and semantically rich images. Existing voxel-based methods face significant challenges when fusing sparse voxel features with...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on geoscience and remote sensing 2023, Vol.61, p.1-12
Hauptverfasser:	Song, Ziying, Zhang, Guoxin, Xie, Jun, Liu, Lin, Jia, Caiyan, Xu, Shaoqing, Wang, Zhepeng
Format:	Artikel
Sprache:	eng
Schlagworte:	3-D object detection Benchmarks Cameras Detection Feature extraction Information processing Laser radar Lidar multimodal fusion Object detection Object recognition patch fusion Point cloud compression Semantics Three-dimensional displays
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Light detection and ranging (LiDAR)-camera fusion can enhance the performance of 3-D object detection by utilizing complementary information between depth-aware LiDAR points and semantically rich images. Existing voxel-based methods face significant challenges when fusing sparse voxel features with dense image features in a one-to-one manner, resulting in the loss of the advantages of images, including semantic and continuity information, leading to suboptimal detection performance, especially at long distances. In this article, we present VoxelNextFusion, a multimodal 3-D object detection framework specifically designed for voxel-based methods, which effectively bridges the gap between sparse point clouds and dense images. In particular, we propose a voxel-based image pipeline that involves projecting point clouds onto images to obtain both pixel- and patch-level features. These features are then fused using a self-attention to obtain a combined representation. Moreover, to address the issue of background features present in patches, we propose a feature importance module that effectively distinguishes between foreground and background features, thus minimizing the impact of the background features. Extensive experiments were conducted on the widely used KITTI and nuScenes 3-D object detection benchmarks. Notably, our VoxelNextFusion achieved around +3.20% in AP@0.7 improvement for car detection in hard level compared to the Voxel R-CNN baseline on the KITTI test dataset.
ISSN:	0196-2892 1558-0644
DOI:	10.1109/TGRS.2023.3331893