Hybrid multi-attention transformer for robust video object detection

Video object detection (VOD) is the task of detecting objects in videos, a challenge due to the changing appearance of objects over time, leading to potential detection errors. Recent research has addressed this by aggregating features from neighboring frames and incorporating information from dista...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Engineering applications of artificial intelligence 2025-01, Vol.139, p.109606, Article 109606
Hauptverfasser:	Moorthy, Sathishkumar, K.S., Sachin Sakthi, Arthanari, Sathiyamoorthi, Jeong, Jae Hoon, Joo, Young Hoon
Format:	Artikel
Sprache:	eng
Schlagworte:	Attention mechanism Target-background embeddings Video object detection Vision transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Video object detection (VOD) is the task of detecting objects in videos, a challenge due to the changing appearance of objects over time, leading to potential detection errors. Recent research has addressed this by aggregating features from neighboring frames and incorporating information from distant frames to mitigate appearance deterioration. However, relying solely on object candidate regions in distant frames, independent of object position, has limitations, as it depends heavily on the performance of these regions and struggles with deteriorated appearances. To overcome these challenges, we propose a novel Hybrid Multi-Attention Transformer (HyMAT) module as our main contribution. HyMAT enhances relevant correlations while suppressing flawed information by searching for an agreement between whole correlation vectors. This module is designed for flexibility and can be integrated into both self- and cross-attention blocks to significantly improve detection accuracy. Additionally, we introduce a simplified Transformer-based object detection framework, named Hybrid Multi-Attention Object Detection (HyMATOD), which leverages competent feature reprocessing and target-background embeddings to more effectively utilize temporal references. Our approach demonstrates state-of-the-art performance, as evaluated on the ImageNet video object detection benchmark (ImageNet VID) and the University at Albany DEtection and TRACking (UA-DETRAC) benchmarks. Specifically, our HyMATOD model achieves an impressive 86.7% mean Average Precision (mAP) on the ImageNet VID dataset, establishing its superiority and practicality for video object detection tasks. These results underscore the significance of our contributions to advancing the field of VOD.
ISSN:	0952-1976
DOI:	10.1016/j.engappai.2024.109606