CTIF-Net: A CNN-Transformer Iterative Fusion Network for Salient Object Detection

Capturing sufficient global context and rich spatial structure information is critical for dense prediction tasks. Convolutional Neural Network (CNN) is particularly adept at modeling fine-grained local features, while Transformer excels at modeling global context information. It is evident that CNN...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on circuits and systems for video technology 2024-05, Vol.34 (5), p.3795-3805
Hauptverfasser:	Yuan, Junbin, Zhu, Aiqing, Xu, Qingzhen, Wattanachote, Kanoksak, Gong, Yongyi
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial neural networks CNN Context Decoding Feature extraction iterative fusion Iterative methods Modelling Modules Object recognition Performance evaluation Salience salient object detection Semantics Task analysis transformer Transformers Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Capturing sufficient global context and rich spatial structure information is critical for dense prediction tasks. Convolutional Neural Network (CNN) is particularly adept at modeling fine-grained local features, while Transformer excels at modeling global context information. It is evident that CNN and Transformer exhibit complementary characteristics. Exploring the design of a network, that efficiently fuses these two models to leverage their strengths fully and achieve more accurate detection, represents a promising and worthwhile research topic. In this paper, we introduce a novel CNN-Transformer Iterative Fusion Network (CTIF-Net) for salient object detection. It efficiently combines CNN and Transformer to achieve superior performance by using a parallel dual encoder structure and a feature iterative fusion module. Firstly, CTIF-Net extracts features from the image using the CNN and the Transformer, respectively. Secondly, two feature convertors and a feature iterative fusion module are employed to combine and iteratively refine the two sets of features. The experimental results on multiple SOD datasets show that CTIF-Net outperforms 17 state-of-the-art methods, achieving higher performance in various mainstream evaluation metrics such as F-measure, S-measure, and MAE value. Code can be found at https://github.com/danielfaster/CTIF-Net/ .
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2023.3321190