MMI-Det: Exploring Multi-Modal Integration for Visible and Infrared Object Detection

The Visible-Infrared (VIS-IR) object detection is a challenging detection task, which combines visible and infrared data to give information on the category and location of objects in the scene. Therefore, the core of this task is to combine complementary information in the visible and infrared moda...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on circuits and systems for video technology 2024-11, Vol.34 (11), p.11198-11213
Hauptverfasser: Zeng, Yuqiao, Liang, Tengfei, Jin, Yi, Li, Yidong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The Visible-Infrared (VIS-IR) object detection is a challenging detection task, which combines visible and infrared data to give information on the category and location of objects in the scene. Therefore, the core of this task is to combine complementary information in the visible and infrared modalities to provide more object detection results for detection. The existing methods mainly face the problem of insufficient ability to perceive and combine visible-infrared modal information and have difficulty in balancing the optimization directions of the fusion and detection tasks. To solve these problem, we propose the MMI-Det which is a multi-modal fusion method for visible and infrared object detection. The method can provide a good combination of complementary information in the visible-infrared modalities and output accurate and robust object information. Specifically, to improve the ability of the model to perceive environment at the visible-infrared image level, we designed the Contour Enhancement Module. Furthermore, to extract complementary information from VIS and IR modalities, we design the Fusion Focus Module. It can extract different frequency spectral features of the visible and infrared modalities and focus on the key information of the object at different spatial locations. Moreover, we design the Contrast Bridge Module to improve the ability to extract modal invariant features in the visible-infrared scene. Finally, to ensure that our model can balance the optimization directions of image fusion and object detection, we design the Info Guided Module as a way to improve the effectiveness of the model's training optimization. We implement extensive experiments on the public FLIR, M3FD, LLVIP, TNO and MSRS datasets, and compared with previous methods, our method achieves better performance with powerful multi-modal information perception capabilities.
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2024.3418965