TCNet: Co-Salient Object Detection via Parallel Interaction of Transformers and CNNs

The purpose of co-salient object detection (CoSOD) is to detect the salient objects that co-occur in a group of relevant images. CoSOD has been significantly prospered by recent advances in convolutional neural networks (CNNs). However, it shows general limitations in modeling long-range feature dep...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on circuits and systems for video technology 2023-06, Vol.33 (6), p.2600-2615
Hauptverfasser:	Ge, Yanliang, Zhang, Qiao, Xiang, Tian-Zhu, Zhang, Cong, Bi, Hongbo
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial neural networks attention mechanism Co-salient object detection Critical components Decoding Feature extraction feature interaction Modules Object detection Object recognition Salience saliency detection Semantics Task analysis transformer Transformers Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The purpose of co-salient object detection (CoSOD) is to detect the salient objects that co-occur in a group of relevant images. CoSOD has been significantly prospered by recent advances in convolutional neural networks (CNNs). However, it shows general limitations in modeling long-range feature dependencies, which is crucial for CoSOD. In the vision transformer, the self-attention mechanism is utilized to capture global dependencies but unfortunately destroy local spatial details, which are also essential for CoSOD. To address the above issues, we propose a dual network structure, called TCNet, which can efficiently excavate both local information and global representations for co-saliency learning via the parallel interaction of Transformers and CNNs. Specifically, it contains three critical components, i.e., the mutual consensus module (MCM), the consensus complementary module (CCM), and the group consistent progressive decoder (GCPD). MCM aims to capture the global consensus from high-level features of these two branches as a guide for the following integration of consensus cues of both branches at each level. Next, CCM is designed to effectively fuse the consensus of local information and global contexts from different levels of the two branches. Finally, GCPD is developed to maintain group feature consistency and predict accurate co-saliency maps. The proposed TCNet is evaluated on five challenging CoSOD benchmark datasets using six widely used metrics, showing that our proposed method is superior to other existing cutting-edge methods for co-salient object detection.
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2022.3225865