A Spatio-Temporal Robust Tracker with Spatial-Channel Transformer and Jitter Suppression

The robustness of visual object tracking is reflected not only in the accuracy of the target localisation in every single frame, but also in the smoothness of the predicted motion of the tracked object across consecutive frames. From the perspective of appearance modelling, the success of the state-...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	International journal of computer vision 2024-05, Vol.132 (5), p.1645-1658
Hauptverfasser:	Zhao, Shaochuan, Xu, Tianyang, Wu, Xiao-Jun, Kittler, Josef
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial Intelligence Aspect ratio Computer Imaging Computer Science Correlation analysis Image Processing and Computer Vision Localization Misalignment Modelling Optical tracking Pattern Recognition Pattern Recognition and Graphics Robustness Smoothness Spatial analysis Special Issue on Robust Vision Transformers Vibration Vision Visualization
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The robustness of visual object tracking is reflected not only in the accuracy of the target localisation in every single frame, but also in the smoothness of the predicted motion of the tracked object across consecutive frames. From the perspective of appearance modelling, the success of the state-of-the-art Transformer-based trackers derives from their ability to adaptively associate the representations of related spatial regions. However, the absence of attention in the channel dimension hinders the realisation of their potential tracking capacity. To cope with the commonly occurring misalignment of the spatial scale between the template and a search patch, we propose a novel cross channel correlation mechanism. Accordingly, the relevance of multi-channel features in the channel Transformer is modelled using two different sources of information. The result is a novel spatial-channel Transformer, which integrates information conveyed by features along both, the spatial and channel directions. For temporal modelling, to quantify the temporal smoothness, we propose a jitter metric that measures the cross-frame variation of the predicted bounding boxes as a function of the parameters such as centre displacement, area, and aspect ratio. As the changes of an object between consecutive frames are limited, the proposed jitter loss can be used to monitor the temporal behaviour of the tracking results and penalise erroneus predictions during the training stage, thus enhancing the temporal stability of an appearance-based tracker. Extensive experiments on several well-known benchmarking datasets demonstrate the robustness of the proposed tracker.
ISSN:	0920-5691 1573-1405
DOI:	10.1007/s11263-023-01902-x