Long-short Range Adaptive Transformer with Dynamic Sampling for 3D Object Detection

3D object detection in point cloud aims at simultaneously localizing and recognizing 3D objects from a 3D point set. However, since point clouds are usually sparse, unordered, and irregular, it is challenging to learn robust point representations and sample high-quality object queries. To deal with...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on circuits and systems for video technology 2023-12, Vol.33 (12), p.1-1
Hauptverfasser:	Wang, Chuxin, Deng, Jiacheng, He, Jianfeng, Zhang, Tianzhu, Zhang, Zhe, Zhang, Yongdong
Format:	Artikel
Sprache:	eng
Schlagworte:	3D object detection Adaptive sampling Benchmarks Coders Convolution Decoding deep learning Object detection Object recognition point cloud Point cloud compression Representations Three dimensional models Three-dimensional displays Training transformer Transformers
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	3D object detection in point cloud aims at simultaneously localizing and recognizing 3D objects from a 3D point set. However, since point clouds are usually sparse, unordered, and irregular, it is challenging to learn robust point representations and sample high-quality object queries. To deal with the above issues, we propose a Long-short rangE Adaptive transformer with Dynamic sampling (LeadNet), including a point representation encoder, a dynamic object query sampling decoder, and an object detection decoder in a unified architecture for 3D object detection. Specifically, in the point representation encoder, we combine an attention layer and a channel attentive kernel convolution layer to consider the local structure and the long-range context simultaneously. In the dynamic object query sampling decoder, we utilize multiple dynamic prototypes to adapt to various point clouds. In the object detection decoder, we incorporate a dynamic Gaussian weight map into the cross-attention mechanism to help the detection decoder focus on the proper visual regions near the object, further accelerating the training process. Extensive experimental results on two standard benchmarks show that our LeadNet outperforms the 3DETR baseline by 11.6% mAP 50 on the ScanNet v2 dataset and achieves the new state-of-the-art results on ScanNet v2 and SUN RGB-D benchmarks for the geometric-only approaches.
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2023.3272734