Learning reliable modal weight with transformer for robust RGBT tracking

Many Siamese-based RGBT trackers have been prevalently designed in recent years for fast-tracking. However, the correlation operation in them is a local linear matching process, which may easily lose semantic information required inevitably by high-precision trackers. In this paper, we propose a str...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Knowledge-based systems 2022-08, Vol.249, p.108945, Article 108945
Hauptverfasser:	Feng, Mingzheng, Su, Jianbo
Format:	Artikel
Sprache:	eng
Schlagworte:	Feature extraction RGBT tracking Robustness Semantic features Semantics Tracking Transformer Transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Many Siamese-based RGBT trackers have been prevalently designed in recent years for fast-tracking. However, the correlation operation in them is a local linear matching process, which may easily lose semantic information required inevitably by high-precision trackers. In this paper, we propose a strong cross-modal model based on transformer for robust RGBT tracking. Specifically, a simple dual-flow convolutional network is designed to extract and fuse dual-modal features, with comparably lower complexity. Besides, to enhance the feature representation and deepen semantic features, a modal weight allocation strategy and a backbone feature extracted network based on modified Resnet-50 are designed, respectively. Also, an attention-based transformer feature fusion network is adopted to improve long-distance feature association to decrease the loss of semantic information. Finally, a classification regression subnetwork is investigated to accurately predict the state of the target. Sufficient experiments have been implemented on the RGBT234, RGBT210, GTOT and LasHeR datasets, demonstrating more outstanding tracking performance against the state-of-the-art RGBT trackers. •An RGBT tracking framework based on the transformer is designed, which can enhance long-distance feature association and decrease the loss of semantic information. To our knowledge, this is the first time to incorporate the transformer in RGBT tracking.•A shallow convolutional network is designed to extract and fuse multi-modal information, which significantly simplifies the calculation process. Moreover, an optimal modal weight allocation strategy is proposed to obtain reliable weight for effectively optimizing fused features.•A classification and regression subnetwork by adding a central branch is adopted to reduce the interference of background, further improving the accuracy of target prediction.•Sufficient experimental results on four large benchmark datasets, RGBT234 (Li et al. 2019), RGBT210 (Li et al. 2017), GTOT (Li et al. 2016) and LasHeR (Li et al. 2022) indicate that the proposed tracker obtains more outstanding performance compared to the state-of-the-art RGBT trackers.
ISSN:	0950-7051 1872-7409
DOI:	10.1016/j.knosys.2022.108945