CrossFusion net: Deep 3D object detection based on RGB images and point clouds in autonomous driving

In recent years, accurate 3D detection plays an important role in a lot of applications. Autonomous driving, for instance, is one of typical representatives. This paper aims to design an accurate 3D detector that takes both Li-DAR point clouds and RGB images as inputs according to the fact that both...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Image and vision computing 2020-08, Vol.100, p.103955, Article 103955
Hauptverfasser: Hong, Dza-Shiang, Chen, Hung-Hao, Hsiao, Pei-Yung, Fu, Li-Chen, Siao, Siang-Min
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In recent years, accurate 3D detection plays an important role in a lot of applications. Autonomous driving, for instance, is one of typical representatives. This paper aims to design an accurate 3D detector that takes both Li-DAR point clouds and RGB images as inputs according to the fact that both LiDAR and camera have their own merits. A deep novel end-to-end two-stream learnable architecture, CrossFusion Net, is designed to exploit features from both LiDAR point clouds as well as RGB images through a hierarchical fusion structure. Specifically, CrossFusion Net utilizes bird's eye view (BEV) of point clouds through projection. Besides, these two feature maps of different streams are fused through the newly introduced CrossFusion(CF) layer. The proposed CF layer transforms feature maps of one stream to another based on the spatial relationship between the BEV and RGB images. Additionally, we apply attention mechanism on the transformed feature map and the original one to automatically decide the importance of the two feature maps from the two sensors. Experiments on the challenging KITTI car 3D detection benchmark and BEV detection benchmark show that the presented approach outperforms the other state-of-the-art methods in average precision(AP), specifically, as well as outperforms UberATG-ContFuse [3] of 8% AP in moderate 3D car detection. Furthermore, the proposed network learns an effective representation in perception of circumstances via RGB feature maps and BEV feature maps. •The proposed CrossFusion Net performs 3D object detection from two sensors.•The presented attention mechanism generates adaptive weights for two streams of feature maps.•The CF Net outperforms 1%, 8%, and 3% APs in easy, moderate, and hard cases, respectively.•The comparable 100 ms inference time is much less than 170-360 ms from others.
ISSN:0262-8856
1872-8138
DOI:10.1016/j.imavis.2020.103955