Enhancing visual tracking with a unified temporal Transformer framework

Visual object tracking is an essential research topic in computer vision with numerous practical applications including visual surveillance systems, autonomous vehicles and intelligent transportation systems. It involves tackling various challenges such as motion blur, occlusion and distractors, whi...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on intelligent vehicles 2024, p.1-15
Hauptverfasser: Zhang, Tianlu, Jin, Ziniu, Debattista, Kurt, Zhang, Qiang, Han, Jungong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Visual object tracking is an essential research topic in computer vision with numerous practical applications including visual surveillance systems, autonomous vehicles and intelligent transportation systems. It involves tackling various challenges such as motion blur, occlusion and distractors, which require trackers to leverage temporal information, including temporal appearance information, temporal trajectory information and temporal context information. However, existing trackers usually focus on employing one certain temporal information while neglecting the complementarity of different types of temporal information. Additionally, cross-frame correlations that enable the transfer of diverse temporal information during tracking are under-explored. In this work, we propose a Unified Temporal Transformer Framework (UTTF) for robust visual tracking. Our framework effectively establishes multi-scale cross-frame relationships within historical frameworks and exploits the complementary information among three typical temporal information sources. Specifically, a Pyramid Spatial-Temporal Transformer Encoder (PSTTE) is designed to mutually reinforce historical features by establishing sound multi-scale associations (i.e., token-level, semantic-level and frame-level). Furthermore, an Adaptive Fusion Transformer Decoder (AFTD) is proposed to adaptively aggregate informative temporal cues from historical frames to enhance features of the current frame. Moreover, the proposed UTTF network can be easily extended to various tracking frameworks. Our experiments on seven prevalent visual object tracking benchmarks demonstrate that our proposed trackers outperform existing ones, establishing new state-of-the-art results.
ISSN:2379-8858
2379-8904
DOI:10.1109/TIV.2024.3398405