Mutual-Guidance Transformer-Embedding Network for Video Salient Object Detection

Video salient object detection (VSOD) aims at locating the most attractive objects presented in video sequences by exploiting spatial and temporal cues. Previous methods mainly utilize convolutional neural networks (CNNs) to fuse or complement across RGB and optical flow cues via simple strategies....

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE signal processing letters 2022, Vol.29, p.1674-1678
Hauptverfasser: Min, Dingyao, Zhang, Chao, Lu, Yukang, Fu, Keren, Zhao, Qijun
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Video salient object detection (VSOD) aims at locating the most attractive objects presented in video sequences by exploiting spatial and temporal cues. Previous methods mainly utilize convolutional neural networks (CNNs) to fuse or complement across RGB and optical flow cues via simple strategies. To take full advantage of CNNs and recently emerged Transformers, this letter proposes a novel mutual-guidance Transformer-embedding network, called MGT-Net, where a mutual-guidance multi-head attention mechanism (MGMA) explores more sophisticated long-range cross-modal interactions. Such a mechanism is designed into a new mutual-guidance Transformer (MGTrans) module that can propagate long-range contextual dependencies based on information of the other modality. To the best of our knowledge, MGT-Net is the first VSOD model that embeds Transformers as modules into CNNs for improved performance. Prior to MGTrans, we also propose and deploy a feature purification module (FPM) to purify noisy backbone features. Experimental results on five benchmark datasets demonstrate the state-of-the-art performance of MGT-Net.
ISSN:1070-9908
1558-2361
DOI:10.1109/LSP.2022.3192753