CSA-RCNN: Cascaded Self-Attention Networks for High-Quality 3-D Object Detection From LiDAR Point Clouds
LiDAR-based 3-D object detection leverages the precise spatial information provided by point clouds to enhance the understanding of 3-D environments. This approach has garnered significant attention from both industry and academia, particularly in fields such as autonomous driving and robotics. Howe...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on instrumentation and measurement 2024, Vol.73, p.1-13 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | LiDAR-based 3-D object detection leverages the precise spatial information provided by point clouds to enhance the understanding of 3-D environments. This approach has garnered significant attention from both industry and academia, particularly in fields such as autonomous driving and robotics. However, how to improve the detection accuracy of long-distance objects remains a critical challenge for existing two-stage methods. This difficulty is primarily attributed to the sparsity and uneven distribution of point clouds, which lead to inconsistent quality in proposals for distant targets. To tackle the challenges, this article proposes a novel and effective 3-D point cloudy detection network based on cascaded self-attention (CSA)-region-based convolutional neural network (RCNN), to achieve higher quality 3-D object detection in traffic scenes. First, to enhance the quality of proposals for long-range objects, we design a cascade self-attention module (CSM) that utilizes a multihead self-attention (MHSA) mechanism across multiple independent cascaded subnetworks to aggregate proposal features at different stages. This approach improves the accuracy of iterative proposal refinement by strengthening the feature modeling across different stages. Second, to enhance the correlation between different representations of the point cloud, we design a transformer-based feature fusion module that fully integrates these multisource features into richer point-wise features. Finally, to remove unnecessary background information from the 3-D scene, we introduce a semantic-guided farthest point sampling (S-FPS) strategy that helps preserve essential foreground points during the downsampling process. Extensive experiments were conducted on the highly competitive KITTI and Waymo datasets, which validated the effectiveness of the proposed method. Notably, CSA-RCNN achieves a +1.01% improvement in average precision (AP) for the car class at the difficult level, compared to the point-voxel (PV)-RCNN on the KITTI validation dataset. |
---|---|
ISSN: | 0018-9456 1557-9662 |
DOI: | 10.1109/TIM.2024.3476690 |