CTIF-Net: A CNN-Transformer Iterative Fusion Network for Salient Object Detection

Capturing sufficient global context and rich spatial structure information is critical for dense prediction tasks. Convolutional Neural Network (CNN) is particularly adept at modeling fine-grained local features, while Transformer excels at modeling global context information. It is evident that CNN...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on circuits and systems for video technology 2024-05, Vol.34 (5), p.3795-3805
Hauptverfasser:	Yuan, Junbin, Zhu, Aiqing, Xu, Qingzhen, Wattanachote, Kanoksak, Gong, Yongyi
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial neural networks CNN Context Decoding Feature extraction iterative fusion Iterative methods Modelling Modules Object recognition Performance evaluation Salience salient object detection Semantics Task analysis transformer Transformers Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	3805
container_issue	5
container_start_page	3795
container_title	IEEE transactions on circuits and systems for video technology
container_volume	34
creator	Yuan, Junbin Zhu, Aiqing Xu, Qingzhen Wattanachote, Kanoksak Gong, Yongyi
description	Capturing sufficient global context and rich spatial structure information is critical for dense prediction tasks. Convolutional Neural Network (CNN) is particularly adept at modeling fine-grained local features, while Transformer excels at modeling global context information. It is evident that CNN and Transformer exhibit complementary characteristics. Exploring the design of a network, that efficiently fuses these two models to leverage their strengths fully and achieve more accurate detection, represents a promising and worthwhile research topic. In this paper, we introduce a novel CNN-Transformer Iterative Fusion Network (CTIF-Net) for salient object detection. It efficiently combines CNN and Transformer to achieve superior performance by using a parallel dual encoder structure and a feature iterative fusion module. Firstly, CTIF-Net extracts features from the image using the CNN and the Transformer, respectively. Secondly, two feature convertors and a feature iterative fusion module are employed to combine and iteratively refine the two sets of features. The experimental results on multiple SOD datasets show that CTIF-Net outperforms 17 state-of-the-art methods, achieving higher performance in various mainstream evaluation metrics such as F-measure, S-measure, and MAE value. Code can be found at https://github.com/danielfaster/CTIF-Net/ .
doi_str_mv	10.1109/TCSVT.2023.3321190
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_10268450</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10268450</ieee_id><sourcerecordid>3053298851</sourcerecordid><originalsourceid>FETCH-LOGICAL-c247t-e2e37d4b537e8e8e9a6beeeb1d0352aa226abc3d7b403e7f84bf48b641f2e9133</originalsourceid><addsrcrecordid>eNpNkE1LAzEQhoMoWD_-gHgIeE7N527WW1mtFkqLdPUaku0sbG13a5Iq_ntT24PM4R2Y950ZHoRuGB0yRov7qly8V0NOuRgKwRkr6AkaMKU04Zyq09RTxYjmTJ2jixBWlDKpZT5Ar2U1GZMZxAc8wuVsRipvu9D0fgMeTyJ4G9svwONdaPsOJ9937z9wmuOFXbfQRTx3K6gjfoSYJJmu0Flj1wGuj3qJ3sZPVflCpvPnSTmakprLPBLgIPKldErkoFMVNnMA4NiSCsWt5TyzrhbL3EkqIG-0dI3ULpOs4VAwIS7R3WHv1vefOwjRrPqd79JJI6gSvNBaseTiB1ft-xA8NGbr2431P4ZRs0dn_tCZPTpzRJdCt4dQmz76F-CZloqKX8COae8</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3053298851</pqid></control><display><type>article</type><title>CTIF-Net: A CNN-Transformer Iterative Fusion Network for Salient Object Detection</title><source>IEEE Electronic Library (IEL)</source><creator>Yuan, Junbin ; Zhu, Aiqing ; Xu, Qingzhen ; Wattanachote, Kanoksak ; Gong, Yongyi</creator><creatorcontrib>Yuan, Junbin ; Zhu, Aiqing ; Xu, Qingzhen ; Wattanachote, Kanoksak ; Gong, Yongyi</creatorcontrib><description>Capturing sufficient global context and rich spatial structure information is critical for dense prediction tasks. Convolutional Neural Network (CNN) is particularly adept at modeling fine-grained local features, while Transformer excels at modeling global context information. It is evident that CNN and Transformer exhibit complementary characteristics. Exploring the design of a network, that efficiently fuses these two models to leverage their strengths fully and achieve more accurate detection, represents a promising and worthwhile research topic. In this paper, we introduce a novel CNN-Transformer Iterative Fusion Network (CTIF-Net) for salient object detection. It efficiently combines CNN and Transformer to achieve superior performance by using a parallel dual encoder structure and a feature iterative fusion module. Firstly, CTIF-Net extracts features from the image using the CNN and the Transformer, respectively. Secondly, two feature convertors and a feature iterative fusion module are employed to combine and iteratively refine the two sets of features. The experimental results on multiple SOD datasets show that CTIF-Net outperforms 17 state-of-the-art methods, achieving higher performance in various mainstream evaluation metrics such as F-measure, S-measure, and MAE value. Code can be found at https://github.com/danielfaster/CTIF-Net/ .</description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2023.3321190</identifier><identifier>CODEN: ITCTEM</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Artificial neural networks ; CNN ; Context ; Decoding ; Feature extraction ; iterative fusion ; Iterative methods ; Modelling ; Modules ; Object recognition ; Performance evaluation ; Salience ; salient object detection ; Semantics ; Task analysis ; transformer ; Transformers ; Visualization</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2024-05, Vol.34 (5), p.3795-3805</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c247t-e2e37d4b537e8e8e9a6beeeb1d0352aa226abc3d7b403e7f84bf48b641f2e9133</cites><orcidid>0000-0002-0954-734X ; 0000-0001-6687-8367 ; 0000-0002-8559-1801 ; 0009-0001-0613-5593 ; 0000-0001-6558-9395</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10268450$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10268450$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Yuan, Junbin</creatorcontrib><creatorcontrib>Zhu, Aiqing</creatorcontrib><creatorcontrib>Xu, Qingzhen</creatorcontrib><creatorcontrib>Wattanachote, Kanoksak</creatorcontrib><creatorcontrib>Gong, Yongyi</creatorcontrib><title>CTIF-Net: A CNN-Transformer Iterative Fusion Network for Salient Object Detection</title><title>IEEE transactions on circuits and systems for video technology</title><addtitle>TCSVT</addtitle><description>Capturing sufficient global context and rich spatial structure information is critical for dense prediction tasks. Convolutional Neural Network (CNN) is particularly adept at modeling fine-grained local features, while Transformer excels at modeling global context information. It is evident that CNN and Transformer exhibit complementary characteristics. Exploring the design of a network, that efficiently fuses these two models to leverage their strengths fully and achieve more accurate detection, represents a promising and worthwhile research topic. In this paper, we introduce a novel CNN-Transformer Iterative Fusion Network (CTIF-Net) for salient object detection. It efficiently combines CNN and Transformer to achieve superior performance by using a parallel dual encoder structure and a feature iterative fusion module. Firstly, CTIF-Net extracts features from the image using the CNN and the Transformer, respectively. Secondly, two feature convertors and a feature iterative fusion module are employed to combine and iteratively refine the two sets of features. The experimental results on multiple SOD datasets show that CTIF-Net outperforms 17 state-of-the-art methods, achieving higher performance in various mainstream evaluation metrics such as F-measure, S-measure, and MAE value. Code can be found at https://github.com/danielfaster/CTIF-Net/ .</description><subject>Artificial neural networks</subject><subject>CNN</subject><subject>Context</subject><subject>Decoding</subject><subject>Feature extraction</subject><subject>iterative fusion</subject><subject>Iterative methods</subject><subject>Modelling</subject><subject>Modules</subject><subject>Object recognition</subject><subject>Performance evaluation</subject><subject>Salience</subject><subject>salient object detection</subject><subject>Semantics</subject><subject>Task analysis</subject><subject>transformer</subject><subject>Transformers</subject><subject>Visualization</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkE1LAzEQhoMoWD_-gHgIeE7N527WW1mtFkqLdPUaku0sbG13a5Iq_ntT24PM4R2Y950ZHoRuGB0yRov7qly8V0NOuRgKwRkr6AkaMKU04Zyq09RTxYjmTJ2jixBWlDKpZT5Ar2U1GZMZxAc8wuVsRipvu9D0fgMeTyJ4G9svwONdaPsOJ9937z9wmuOFXbfQRTx3K6gjfoSYJJmu0Flj1wGuj3qJ3sZPVflCpvPnSTmakprLPBLgIPKldErkoFMVNnMA4NiSCsWt5TyzrhbL3EkqIG-0dI3ULpOs4VAwIS7R3WHv1vefOwjRrPqd79JJI6gSvNBaseTiB1ft-xA8NGbr2431P4ZRs0dn_tCZPTpzRJdCt4dQmz76F-CZloqKX8COae8</recordid><startdate>20240501</startdate><enddate>20240501</enddate><creator>Yuan, Junbin</creator><creator>Zhu, Aiqing</creator><creator>Xu, Qingzhen</creator><creator>Wattanachote, Kanoksak</creator><creator>Gong, Yongyi</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-0954-734X</orcidid><orcidid>https://orcid.org/0000-0001-6687-8367</orcidid><orcidid>https://orcid.org/0000-0002-8559-1801</orcidid><orcidid>https://orcid.org/0009-0001-0613-5593</orcidid><orcidid>https://orcid.org/0000-0001-6558-9395</orcidid></search><sort><creationdate>20240501</creationdate><title>CTIF-Net: A CNN-Transformer Iterative Fusion Network for Salient Object Detection</title><author>Yuan, Junbin ; Zhu, Aiqing ; Xu, Qingzhen ; Wattanachote, Kanoksak ; Gong, Yongyi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c247t-e2e37d4b537e8e8e9a6beeeb1d0352aa226abc3d7b403e7f84bf48b641f2e9133</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Artificial neural networks</topic><topic>CNN</topic><topic>Context</topic><topic>Decoding</topic><topic>Feature extraction</topic><topic>iterative fusion</topic><topic>Iterative methods</topic><topic>Modelling</topic><topic>Modules</topic><topic>Object recognition</topic><topic>Performance evaluation</topic><topic>Salience</topic><topic>salient object detection</topic><topic>Semantics</topic><topic>Task analysis</topic><topic>transformer</topic><topic>Transformers</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Yuan, Junbin</creatorcontrib><creatorcontrib>Zhu, Aiqing</creatorcontrib><creatorcontrib>Xu, Qingzhen</creatorcontrib><creatorcontrib>Wattanachote, Kanoksak</creatorcontrib><creatorcontrib>Gong, Yongyi</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Yuan, Junbin</au><au>Zhu, Aiqing</au><au>Xu, Qingzhen</au><au>Wattanachote, Kanoksak</au><au>Gong, Yongyi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>CTIF-Net: A CNN-Transformer Iterative Fusion Network for Salient Object Detection</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><stitle>TCSVT</stitle><date>2024-05-01</date><risdate>2024</risdate><volume>34</volume><issue>5</issue><spage>3795</spage><epage>3805</epage><pages>3795-3805</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><coden>ITCTEM</coden><abstract>Capturing sufficient global context and rich spatial structure information is critical for dense prediction tasks. Convolutional Neural Network (CNN) is particularly adept at modeling fine-grained local features, while Transformer excels at modeling global context information. It is evident that CNN and Transformer exhibit complementary characteristics. Exploring the design of a network, that efficiently fuses these two models to leverage their strengths fully and achieve more accurate detection, represents a promising and worthwhile research topic. In this paper, we introduce a novel CNN-Transformer Iterative Fusion Network (CTIF-Net) for salient object detection. It efficiently combines CNN and Transformer to achieve superior performance by using a parallel dual encoder structure and a feature iterative fusion module. Firstly, CTIF-Net extracts features from the image using the CNN and the Transformer, respectively. Secondly, two feature convertors and a feature iterative fusion module are employed to combine and iteratively refine the two sets of features. The experimental results on multiple SOD datasets show that CTIF-Net outperforms 17 state-of-the-art methods, achieving higher performance in various mainstream evaluation metrics such as F-measure, S-measure, and MAE value. Code can be found at https://github.com/danielfaster/CTIF-Net/ .</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TCSVT.2023.3321190</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0002-0954-734X</orcidid><orcidid>https://orcid.org/0000-0001-6687-8367</orcidid><orcidid>https://orcid.org/0000-0002-8559-1801</orcidid><orcidid>https://orcid.org/0009-0001-0613-5593</orcidid><orcidid>https://orcid.org/0000-0001-6558-9395</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1051-8215
ispartof	IEEE transactions on circuits and systems for video technology, 2024-05, Vol.34 (5), p.3795-3805
issn	1051-8215 1558-2205
language	eng
recordid	cdi_ieee_primary_10268450
source	IEEE Electronic Library (IEL)
subjects	Artificial neural networks CNN Context Decoding Feature extraction iterative fusion Iterative methods Modelling Modules Object recognition Performance evaluation Salience salient object detection Semantics Task analysis transformer Transformers Visualization
title	CTIF-Net: A CNN-Transformer Iterative Fusion Network for Salient Object Detection
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-12T16%3A02%3A19IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=CTIF-Net:%20A%20CNN-Transformer%20Iterative%20Fusion%20Network%20for%20Salient%20Object%20Detection&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Yuan,%20Junbin&rft.date=2024-05-01&rft.volume=34&rft.issue=5&rft.spage=3795&rft.epage=3805&rft.pages=3795-3805&rft.issn=1051-8215&rft.eissn=1558-2205&rft.coden=ITCTEM&rft_id=info:doi/10.1109/TCSVT.2023.3321190&rft_dat=%3Cproquest_RIE%3E3053298851%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3053298851&rft_id=info:pmid/&rft_ieee_id=10268450&rfr_iscdi=true