Adaptive feature fusion for visual object tracking
•We propose an adaptive feature fusion mechanism to provide both semantic and discriminative feature representations by automatically fusing multi-level convolutional layers.•We reformulate the update strategy. Through joint training the projection matrix layer and correlation layer, a more convinci...
Gespeichert in:
Veröffentlicht in: | Pattern recognition 2021-03, Vol.111, p.107679, Article 107679 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | •We propose an adaptive feature fusion mechanism to provide both semantic and discriminative feature representations by automatically fusing multi-level convolutional layers.•We reformulate the update strategy. Through joint training the projection matrix layer and correlation layer, a more convincing target localization formulation can be achieved.•We validate our method on several benchmarking datasets with state-of-the-art methods. The experimental results and corresponding analysis demonstrate the merit of the proposed tracker.
Recent advanced trackers, consisting of discriminative classification component and dedicated bounding box estimation, have achieved improved performance in the visual tracking community. The most essential factor for the development is the utilization of different Convolutional Neural Networks (CNNs), which significantly improves the model capacity via offline trained deep feature representations. Though powerful deep structures emphasize more semantic appearance through high dimensional latent variables, how to achieve effective feature adaptation in the online tracking stage has not been sufficiently considered yet. To this end, we argue the necessity of exploring hierarchical and complementary appearance descriptors from different convolutional layers to achieve online tracking adaptation. Therefore, in this paper, we propose an adaptive feature fusion mechanism, which can balance the detection granularities from shallow to deep convolutional layers. To be specific, the correlation between template and instance is employed to generate adaptive weights to achieve advanced saliency and discrimination. In addition, considering temporal appearance variation, the projection matrix for the multi-channel inputs is jointly updated with the correlation classifier to further enhance the robustness. The experimental results on four recent benchmarks, i.e., OTB-2015, VOT2018, LaSOT and TrackingNet, demonstrate the effectiveness and robustness of the proposed method, with superior performance compared to the state-of-the-art approaches. |
---|---|
ISSN: | 0031-3203 1873-5142 |
DOI: | 10.1016/j.patcog.2020.107679 |