Multi-scale feature fusion with attention mechanism for crowded road object detection

Crowded object detection under the heavy traffic environment is always a challenging task in the field of autonomous driving and robotics, because the dense gathering of vehicles or pedestrians inevitably bring heavy occlusion. It is difficult to distinguish highly overlapped objects and predict the...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of real-time image processing 2024-04, Vol.21 (2), p.29, Article 29
Hauptverfasser:	Wu, Jingtao, Dai, Guojun, Zhou, Wenhui, Zhu, Xudong, Wang, Zengguan
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Algorithms Boxes Computer Graphics Computer networks Computer Science Datasets Deep learning Dynamic focusing Image Processing and Computer Vision Modules Multimedia Information Systems Neural networks Object recognition Occlusion Pattern Recognition Pedestrians Roads Robotics Semantics Sensors Signal,Image and Speech Processing Vehicles
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Crowded object detection under the heavy traffic environment is always a challenging task in the field of autonomous driving and robotics, because the dense gathering of vehicles or pedestrians inevitably bring heavy occlusion. It is difficult to distinguish highly overlapped objects and predict their bounding boxes accurately, especially for small objects far down the road. To address this challenge, this paper proposes an improved YOLOv5s network integrating a multi-scale feature fusion module with attention mechanism for crowded road object detection task. Specifically, to enhance the multi-scale representation of semantic features and to model the object scale variation flexibly, we introduce an attention-guided pyramid feature fusion strategy into the YOLOv5s backbone network. Then a C3CA module is designed by embedding the coordinate attention (CA) into the concentrated-comprehensive convolution (C3) module of the original YOLOv5s, which can boost the ability of extracting distinguishing features from the overlapped objects. In addition, we add implicit detection heads (IDHs) into the original YOLOv5s’s detection head part, which helps the network to learn implicit knowledge and improves the detection accuracy. Finally, a simplified optimal transport assignment (SimOTA) and a bounding box regression loss with dynamic focusing mechanism are used to improve the detector’s overall performance. Extensive experiments on the public dataset BDD100K and our self-built crowded road object dataset (XMRD) demonstrate the superiority of our model in crowded road scenarios. The mean average precision (mAP) of our model can achieve 71.2% and 88.2% on the BDD100K and XMRD datasets, respectively, which provides an improvement of +3% over the existing state of the art models.
ISSN:	1861-8200 1861-8219
DOI:	10.1007/s11554-023-01409-1