CTIF-Net: A CNN-Transformer Iterative Fusion Network for Salient Object Detection

Capturing sufficient global context and rich spatial structure information is critical for dense prediction tasks. Convolutional Neural Network (CNN) is particularly adept at modeling fine-grained local features, while Transformer excels at modeling global context information. It is evident that CNN...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on circuits and systems for video technology 2024-05, Vol.34 (5), p.3795-3805
Hauptverfasser: Yuan, Junbin, Zhu, Aiqing, Xu, Qingzhen, Wattanachote, Kanoksak, Gong, Yongyi
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 3805
container_issue 5
container_start_page 3795
container_title IEEE transactions on circuits and systems for video technology
container_volume 34
creator Yuan, Junbin
Zhu, Aiqing
Xu, Qingzhen
Wattanachote, Kanoksak
Gong, Yongyi
description Capturing sufficient global context and rich spatial structure information is critical for dense prediction tasks. Convolutional Neural Network (CNN) is particularly adept at modeling fine-grained local features, while Transformer excels at modeling global context information. It is evident that CNN and Transformer exhibit complementary characteristics. Exploring the design of a network, that efficiently fuses these two models to leverage their strengths fully and achieve more accurate detection, represents a promising and worthwhile research topic. In this paper, we introduce a novel CNN-Transformer Iterative Fusion Network (CTIF-Net) for salient object detection. It efficiently combines CNN and Transformer to achieve superior performance by using a parallel dual encoder structure and a feature iterative fusion module. Firstly, CTIF-Net extracts features from the image using the CNN and the Transformer, respectively. Secondly, two feature convertors and a feature iterative fusion module are employed to combine and iteratively refine the two sets of features. The experimental results on multiple SOD datasets show that CTIF-Net outperforms 17 state-of-the-art methods, achieving higher performance in various mainstream evaluation metrics such as F-measure, S-measure, and MAE value. Code can be found at https://github.com/danielfaster/CTIF-Net/ .
doi_str_mv 10.1109/TCSVT.2023.3321190
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_10268450</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10268450</ieee_id><sourcerecordid>3053298851</sourcerecordid><originalsourceid>FETCH-LOGICAL-c247t-e2e37d4b537e8e8e9a6beeeb1d0352aa226abc3d7b403e7f84bf48b641f2e9133</originalsourceid><addsrcrecordid>eNpNkE1LAzEQhoMoWD_-gHgIeE7N527WW1mtFkqLdPUaku0sbG13a5Iq_ntT24PM4R2Y950ZHoRuGB0yRov7qly8V0NOuRgKwRkr6AkaMKU04Zyq09RTxYjmTJ2jixBWlDKpZT5Ar2U1GZMZxAc8wuVsRipvu9D0fgMeTyJ4G9svwONdaPsOJ9937z9wmuOFXbfQRTx3K6gjfoSYJJmu0Flj1wGuj3qJ3sZPVflCpvPnSTmakprLPBLgIPKldErkoFMVNnMA4NiSCsWt5TyzrhbL3EkqIG-0dI3ULpOs4VAwIS7R3WHv1vefOwjRrPqd79JJI6gSvNBaseTiB1ft-xA8NGbr2431P4ZRs0dn_tCZPTpzRJdCt4dQmz76F-CZloqKX8COae8</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3053298851</pqid></control><display><type>article</type><title>CTIF-Net: A CNN-Transformer Iterative Fusion Network for Salient Object Detection</title><source>IEEE Electronic Library (IEL)</source><creator>Yuan, Junbin ; Zhu, Aiqing ; Xu, Qingzhen ; Wattanachote, Kanoksak ; Gong, Yongyi</creator><creatorcontrib>Yuan, Junbin ; Zhu, Aiqing ; Xu, Qingzhen ; Wattanachote, Kanoksak ; Gong, Yongyi</creatorcontrib><description>Capturing sufficient global context and rich spatial structure information is critical for dense prediction tasks. Convolutional Neural Network (CNN) is particularly adept at modeling fine-grained local features, while Transformer excels at modeling global context information. It is evident that CNN and Transformer exhibit complementary characteristics. Exploring the design of a network, that efficiently fuses these two models to leverage their strengths fully and achieve more accurate detection, represents a promising and worthwhile research topic. In this paper, we introduce a novel CNN-Transformer Iterative Fusion Network (CTIF-Net) for salient object detection. It efficiently combines CNN and Transformer to achieve superior performance by using a parallel dual encoder structure and a feature iterative fusion module. Firstly, CTIF-Net extracts features from the image using the CNN and the Transformer, respectively. Secondly, two feature convertors and a feature iterative fusion module are employed to combine and iteratively refine the two sets of features. The experimental results on multiple SOD datasets show that CTIF-Net outperforms 17 state-of-the-art methods, achieving higher performance in various mainstream evaluation metrics such as F-measure, S-measure, and MAE value. Code can be found at https://github.com/danielfaster/CTIF-Net/ .</description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2023.3321190</identifier><identifier>CODEN: ITCTEM</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Artificial neural networks ; CNN ; Context ; Decoding ; Feature extraction ; iterative fusion ; Iterative methods ; Modelling ; Modules ; Object recognition ; Performance evaluation ; Salience ; salient object detection ; Semantics ; Task analysis ; transformer ; Transformers ; Visualization</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2024-05, Vol.34 (5), p.3795-3805</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c247t-e2e37d4b537e8e8e9a6beeeb1d0352aa226abc3d7b403e7f84bf48b641f2e9133</cites><orcidid>0000-0002-0954-734X ; 0000-0001-6687-8367 ; 0000-0002-8559-1801 ; 0009-0001-0613-5593 ; 0000-0001-6558-9395</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10268450$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10268450$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Yuan, Junbin</creatorcontrib><creatorcontrib>Zhu, Aiqing</creatorcontrib><creatorcontrib>Xu, Qingzhen</creatorcontrib><creatorcontrib>Wattanachote, Kanoksak</creatorcontrib><creatorcontrib>Gong, Yongyi</creatorcontrib><title>CTIF-Net: A CNN-Transformer Iterative Fusion Network for Salient Object Detection</title><title>IEEE transactions on circuits and systems for video technology</title><addtitle>TCSVT</addtitle><description>Capturing sufficient global context and rich spatial structure information is critical for dense prediction tasks. Convolutional Neural Network (CNN) is particularly adept at modeling fine-grained local features, while Transformer excels at modeling global context information. It is evident that CNN and Transformer exhibit complementary characteristics. Exploring the design of a network, that efficiently fuses these two models to leverage their strengths fully and achieve more accurate detection, represents a promising and worthwhile research topic. In this paper, we introduce a novel CNN-Transformer Iterative Fusion Network (CTIF-Net) for salient object detection. It efficiently combines CNN and Transformer to achieve superior performance by using a parallel dual encoder structure and a feature iterative fusion module. Firstly, CTIF-Net extracts features from the image using the CNN and the Transformer, respectively. Secondly, two feature convertors and a feature iterative fusion module are employed to combine and iteratively refine the two sets of features. The experimental results on multiple SOD datasets show that CTIF-Net outperforms 17 state-of-the-art methods, achieving higher performance in various mainstream evaluation metrics such as F-measure, S-measure, and MAE value. Code can be found at https://github.com/danielfaster/CTIF-Net/ .</description><subject>Artificial neural networks</subject><subject>CNN</subject><subject>Context</subject><subject>Decoding</subject><subject>Feature extraction</subject><subject>iterative fusion</subject><subject>Iterative methods</subject><subject>Modelling</subject><subject>Modules</subject><subject>Object recognition</subject><subject>Performance evaluation</subject><subject>Salience</subject><subject>salient object detection</subject><subject>Semantics</subject><subject>Task analysis</subject><subject>transformer</subject><subject>Transformers</subject><subject>Visualization</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkE1LAzEQhoMoWD_-gHgIeE7N527WW1mtFkqLdPUaku0sbG13a5Iq_ntT24PM4R2Y950ZHoRuGB0yRov7qly8V0NOuRgKwRkr6AkaMKU04Zyq09RTxYjmTJ2jixBWlDKpZT5Ar2U1GZMZxAc8wuVsRipvu9D0fgMeTyJ4G9svwONdaPsOJ9937z9wmuOFXbfQRTx3K6gjfoSYJJmu0Flj1wGuj3qJ3sZPVflCpvPnSTmakprLPBLgIPKldErkoFMVNnMA4NiSCsWt5TyzrhbL3EkqIG-0dI3ULpOs4VAwIS7R3WHv1vefOwjRrPqd79JJI6gSvNBaseTiB1ft-xA8NGbr2431P4ZRs0dn_tCZPTpzRJdCt4dQmz76F-CZloqKX8COae8</recordid><startdate>20240501</startdate><enddate>20240501</enddate><creator>Yuan, Junbin</creator><creator>Zhu, Aiqing</creator><creator>Xu, Qingzhen</creator><creator>Wattanachote, Kanoksak</creator><creator>Gong, Yongyi</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-0954-734X</orcidid><orcidid>https://orcid.org/0000-0001-6687-8367</orcidid><orcidid>https://orcid.org/0000-0002-8559-1801</orcidid><orcidid>https://orcid.org/0009-0001-0613-5593</orcidid><orcidid>https://orcid.org/0000-0001-6558-9395</orcidid></search><sort><creationdate>20240501</creationdate><title>CTIF-Net: A CNN-Transformer Iterative Fusion Network for Salient Object Detection</title><author>Yuan, Junbin ; Zhu, Aiqing ; Xu, Qingzhen ; Wattanachote, Kanoksak ; Gong, Yongyi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c247t-e2e37d4b537e8e8e9a6beeeb1d0352aa226abc3d7b403e7f84bf48b641f2e9133</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Artificial neural networks</topic><topic>CNN</topic><topic>Context</topic><topic>Decoding</topic><topic>Feature extraction</topic><topic>iterative fusion</topic><topic>Iterative methods</topic><topic>Modelling</topic><topic>Modules</topic><topic>Object recognition</topic><topic>Performance evaluation</topic><topic>Salience</topic><topic>salient object detection</topic><topic>Semantics</topic><topic>Task analysis</topic><topic>transformer</topic><topic>Transformers</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Yuan, Junbin</creatorcontrib><creatorcontrib>Zhu, Aiqing</creatorcontrib><creatorcontrib>Xu, Qingzhen</creatorcontrib><creatorcontrib>Wattanachote, Kanoksak</creatorcontrib><creatorcontrib>Gong, Yongyi</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Yuan, Junbin</au><au>Zhu, Aiqing</au><au>Xu, Qingzhen</au><au>Wattanachote, Kanoksak</au><au>Gong, Yongyi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>CTIF-Net: A CNN-Transformer Iterative Fusion Network for Salient Object Detection</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><stitle>TCSVT</stitle><date>2024-05-01</date><risdate>2024</risdate><volume>34</volume><issue>5</issue><spage>3795</spage><epage>3805</epage><pages>3795-3805</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><coden>ITCTEM</coden><abstract>Capturing sufficient global context and rich spatial structure information is critical for dense prediction tasks. Convolutional Neural Network (CNN) is particularly adept at modeling fine-grained local features, while Transformer excels at modeling global context information. It is evident that CNN and Transformer exhibit complementary characteristics. Exploring the design of a network, that efficiently fuses these two models to leverage their strengths fully and achieve more accurate detection, represents a promising and worthwhile research topic. In this paper, we introduce a novel CNN-Transformer Iterative Fusion Network (CTIF-Net) for salient object detection. It efficiently combines CNN and Transformer to achieve superior performance by using a parallel dual encoder structure and a feature iterative fusion module. Firstly, CTIF-Net extracts features from the image using the CNN and the Transformer, respectively. Secondly, two feature convertors and a feature iterative fusion module are employed to combine and iteratively refine the two sets of features. The experimental results on multiple SOD datasets show that CTIF-Net outperforms 17 state-of-the-art methods, achieving higher performance in various mainstream evaluation metrics such as F-measure, S-measure, and MAE value. Code can be found at https://github.com/danielfaster/CTIF-Net/ .</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TCSVT.2023.3321190</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0002-0954-734X</orcidid><orcidid>https://orcid.org/0000-0001-6687-8367</orcidid><orcidid>https://orcid.org/0000-0002-8559-1801</orcidid><orcidid>https://orcid.org/0009-0001-0613-5593</orcidid><orcidid>https://orcid.org/0000-0001-6558-9395</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1051-8215
ispartof IEEE transactions on circuits and systems for video technology, 2024-05, Vol.34 (5), p.3795-3805
issn 1051-8215
1558-2205
language eng
recordid cdi_ieee_primary_10268450
source IEEE Electronic Library (IEL)
subjects Artificial neural networks
CNN
Context
Decoding
Feature extraction
iterative fusion
Iterative methods
Modelling
Modules
Object recognition
Performance evaluation
Salience
salient object detection
Semantics
Task analysis
transformer
Transformers
Visualization
title CTIF-Net: A CNN-Transformer Iterative Fusion Network for Salient Object Detection
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-12T16%3A02%3A19IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=CTIF-Net:%20A%20CNN-Transformer%20Iterative%20Fusion%20Network%20for%20Salient%20Object%20Detection&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Yuan,%20Junbin&rft.date=2024-05-01&rft.volume=34&rft.issue=5&rft.spage=3795&rft.epage=3805&rft.pages=3795-3805&rft.issn=1051-8215&rft.eissn=1558-2205&rft.coden=ITCTEM&rft_id=info:doi/10.1109/TCSVT.2023.3321190&rft_dat=%3Cproquest_RIE%3E3053298851%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3053298851&rft_id=info:pmid/&rft_ieee_id=10268450&rfr_iscdi=true