One-Stream Vision-Language Memory Network for Object Tracking

Most existing tracking methods try to represent the target by exploiting visual information as much as possible based on the various deep networks. However, the appearance model hardly describes the attribute feature of the target well, which makes the trackers fail to adapt to the complex visual su...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on multimedia 2024, Vol.26, p.1720-1730
Hauptverfasser: Zhang, Huanlong, Wang, Jingchao, Zhang, Jianwei, Zhang, Tianzhu, Zhong, Bineng
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Most existing tracking methods try to represent the target by exploiting visual information as much as possible based on the various deep networks. However, the appearance model hardly describes the attribute feature of the target well, which makes the trackers fail to adapt to the complex visual surrounding. In this article, inspired by brain-like intelligence, we propose an One-stream Vision-Language Memory network (OVLM) for object tracking. Firstly, we use the combination of vision and language to build the target model and use the semantic information in the language to compensate for the instability of visual information, making the target model more stable in the face of complex appearance changes. Secondly, to build a more compact target model, we propose a memory token selection mechanism that utilizes linguistic information to eliminate tokens that do not contain target information. Furthermore, to provide better visual information for target modeling, we propose a language-based evaluation method to select high-quality target samples to be stored in the memory. Finally, OVLM achieves a 64.7% success rate on the large-scale tracking benchmark dataset TNL2K, outperforming the previous best result (VLT) by 11.6%. By exposing the possibility of the vision-language memory network, we aim to draw greater attention to it and open up new avenues for vision-language tracking.
ISSN:1520-9210
1941-0077
DOI:10.1109/TMM.2023.3285441