SiamMaskAttn: inverted residual attention block fusing multi-scale feature information for multitask visual object tracking networks
Multitask learning combining visual object tracking and other computer vision tasks has received increasing attention from researchers. Among them, the SiamMask algorithm can accomplish both object tracking and object segmentation tasks by utilizing a Siamese backbone network and a three-branch regr...
Gespeichert in:
Veröffentlicht in: | Signal, image and video processing image and video processing, 2024-03, Vol.18 (2), p.1305-1316 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Multitask learning combining visual object tracking and other computer vision tasks has received increasing attention from researchers. Among them, the SiamMask algorithm can accomplish both object tracking and object segmentation tasks by utilizing a Siamese backbone network and a three-branch regression head. The mask refinement branch is the core innovation part of the SiamMask, which hierarchically integrates the features of the search region and the tracking correlation score maps. However, SiamMask and its subsequent improved algorithms do not fully integrated the target semantic information contained in multi-scale features into the mask refinement branch. To address the above problems, a module named inverted residual attention block is proposed, which combines the inverted residual structure and channel attention mechanism. The channel attention mechanism can effectively enhance the key information of the object and suppress the background noises by assigning weights to the feature channels output by different convolution kernels, thereby better handling the motion and deformation of the tracking object. Based on the proposed module and spatial attention mechanism, a novel multi-scale feature fusion method of the search region and tracking correlation score maps is proposed. The spatial attention mechanism can help the network focus on the region where the object is located and reduce the sensitivity to background interference, thus improving the accuracy and stability of tracking. Under the condition of using the same hardware and datasets, ablation experiments prove that the proposed improvements for the mask refinement branch are effective. Compared with the baseline SiamMask, the proposed method has achieved comparable segmentation results on the DAVIS datasets with improved speed. The expected average overlap on VOT-2018 has increased by 3.7%. The total number of parameters is reduced by 6.6%, including a 53.2% reduction in the number of parameters in the mask refinement branch. |
---|---|
ISSN: | 1863-1703 1863-1711 |
DOI: | 10.1007/s11760-023-02827-1 |