Hybrid multi-attention transformer for robust video object detection
Video object detection (VOD) is the task of detecting objects in videos, a challenge due to the changing appearance of objects over time, leading to potential detection errors. Recent research has addressed this by aggregating features from neighboring frames and incorporating information from dista...
Gespeichert in:
Veröffentlicht in: | Engineering applications of artificial intelligence 2025-01, Vol.139, p.109606, Article 109606 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Video object detection (VOD) is the task of detecting objects in videos, a challenge due to the changing appearance of objects over time, leading to potential detection errors. Recent research has addressed this by aggregating features from neighboring frames and incorporating information from distant frames to mitigate appearance deterioration. However, relying solely on object candidate regions in distant frames, independent of object position, has limitations, as it depends heavily on the performance of these regions and struggles with deteriorated appearances. To overcome these challenges, we propose a novel Hybrid Multi-Attention Transformer (HyMAT) module as our main contribution. HyMAT enhances relevant correlations while suppressing flawed information by searching for an agreement between whole correlation vectors. This module is designed for flexibility and can be integrated into both self- and cross-attention blocks to significantly improve detection accuracy. Additionally, we introduce a simplified Transformer-based object detection framework, named Hybrid Multi-Attention Object Detection (HyMATOD), which leverages competent feature reprocessing and target-background embeddings to more effectively utilize temporal references. Our approach demonstrates state-of-the-art performance, as evaluated on the ImageNet video object detection benchmark (ImageNet VID) and the University at Albany DEtection and TRACking (UA-DETRAC) benchmarks. Specifically, our HyMATOD model achieves an impressive 86.7% mean Average Precision (mAP) on the ImageNet VID dataset, establishing its superiority and practicality for video object detection tasks. These results underscore the significance of our contributions to advancing the field of VOD. |
---|---|
ISSN: | 0952-1976 |
DOI: | 10.1016/j.engappai.2024.109606 |