Vehicle Detection Based on Adaptive Multi-modal Feature Fusion and Cross-modal Vehicle Index using RGB-T Images

Target detection is a critical task in interpreting aerial images. Small target detection, such as vehicles, is challenging. Different lighting conditions affect the accuracy of vehicle detection. For example, vehicles are difficult to distinguish from the background in RGB images under low illumina...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE journal of selected topics in applied earth observations and remote sensing 2023-07, p.1-12
Hauptverfasser:	Wu, Yuanfeng, Guan, Xinran, Zhao, Boya, Ni, Li, Huang, Min
Format:	Artikel
Sprache:	eng
Schlagworte:	adaptive feature fusion aerial images channel attention Convolutional neural networks cross-modal vehicle index Feature extraction Indexes Lighting Object detection Task analysis Vehicle detection
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Target detection is a critical task in interpreting aerial images. Small target detection, such as vehicles, is challenging. Different lighting conditions affect the accuracy of vehicle detection. For example, vehicles are difficult to distinguish from the background in RGB images under low illumination conditions. In contrast, under high illumination conditions, the color and texture of vehicles are not significantly different in thermal infrared (TIR) images. To improve the accuracy of vehicle detection under various illumination conditions, we propose an adaptive multi-modal feature fusion and cross-modal vehicle index (AFFCM) model for vehicle detection. Based on the single-stage object detection model, AFFCM uses red, green, blue, and thermal infrared (RGB-T) images. It comprises three parts: 1) the softpooling channel attention (SCA) mechanism calculates the cross-modal feature weights of the RGB and TIR features using a fully connected layer during global weighted pooling. 2) We design a multi-modal adaptive feature fusion (MAFF) module based on the cross-modal feature weights derived from the SCA mechanism. The MAFF selects features with high weight, compresses redundant features with low weight, and performs adaptive fusion using a multi-scale feature pyramid. 3) A cross-modal vehicle index is established to extract the target area, suppress complex background information, and minimize false alarms in vehicle detection. The mean average precision (mAP) on the Drone Vehicle dataset are 14.44% and 5.02% higher than those obtained using only RGB or TIR images. The mAP is 2.63% higher than that of state-of-the-art (SOTA) methods that utilize RGB and TIR images.
ISSN:	1939-1404
DOI:	10.1109/JSTARS.2023.3294624