CTIF-Net: A CNN-Transformer Iterative Fusion Network for Salient Object Detection
Capturing sufficient global context and rich spatial structure information is critical for dense prediction tasks. Convolutional Neural Network (CNN) is particularly adept at modeling fine-grained local features, while Transformer excels at modeling global context information. It is evident that CNN...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on circuits and systems for video technology 2024-05, Vol.34 (5), p.3795-3805 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 3805 |
---|---|
container_issue | 5 |
container_start_page | 3795 |
container_title | IEEE transactions on circuits and systems for video technology |
container_volume | 34 |
creator | Yuan, Junbin Zhu, Aiqing Xu, Qingzhen Wattanachote, Kanoksak Gong, Yongyi |
description | Capturing sufficient global context and rich spatial structure information is critical for dense prediction tasks. Convolutional Neural Network (CNN) is particularly adept at modeling fine-grained local features, while Transformer excels at modeling global context information. It is evident that CNN and Transformer exhibit complementary characteristics. Exploring the design of a network, that efficiently fuses these two models to leverage their strengths fully and achieve more accurate detection, represents a promising and worthwhile research topic. In this paper, we introduce a novel CNN-Transformer Iterative Fusion Network (CTIF-Net) for salient object detection. It efficiently combines CNN and Transformer to achieve superior performance by using a parallel dual encoder structure and a feature iterative fusion module. Firstly, CTIF-Net extracts features from the image using the CNN and the Transformer, respectively. Secondly, two feature convertors and a feature iterative fusion module are employed to combine and iteratively refine the two sets of features. The experimental results on multiple SOD datasets show that CTIF-Net outperforms 17 state-of-the-art methods, achieving higher performance in various mainstream evaluation metrics such as F-measure, S-measure, and MAE value. Code can be found at https://github.com/danielfaster/CTIF-Net/ . |
doi_str_mv | 10.1109/TCSVT.2023.3321190 |
format | Article |
fullrecord | <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_10268450</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10268450</ieee_id><sourcerecordid>3053298851</sourcerecordid><originalsourceid>FETCH-LOGICAL-c247t-e2e37d4b537e8e8e9a6beeeb1d0352aa226abc3d7b403e7f84bf48b641f2e9133</originalsourceid><addsrcrecordid>eNpNkE1LAzEQhoMoWD_-gHgIeE7N527WW1mtFkqLdPUaku0sbG13a5Iq_ntT24PM4R2Y950ZHoRuGB0yRov7qly8V0NOuRgKwRkr6AkaMKU04Zyq09RTxYjmTJ2jixBWlDKpZT5Ar2U1GZMZxAc8wuVsRipvu9D0fgMeTyJ4G9svwONdaPsOJ9937z9wmuOFXbfQRTx3K6gjfoSYJJmu0Flj1wGuj3qJ3sZPVflCpvPnSTmakprLPBLgIPKldErkoFMVNnMA4NiSCsWt5TyzrhbL3EkqIG-0dI3ULpOs4VAwIS7R3WHv1vefOwjRrPqd79JJI6gSvNBaseTiB1ft-xA8NGbr2431P4ZRs0dn_tCZPTpzRJdCt4dQmz76F-CZloqKX8COae8</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3053298851</pqid></control><display><type>article</type><title>CTIF-Net: A CNN-Transformer Iterative Fusion Network for Salient Object Detection</title><source>IEEE Electronic Library (IEL)</source><creator>Yuan, Junbin ; Zhu, Aiqing ; Xu, Qingzhen ; Wattanachote, Kanoksak ; Gong, Yongyi</creator><creatorcontrib>Yuan, Junbin ; Zhu, Aiqing ; Xu, Qingzhen ; Wattanachote, Kanoksak ; Gong, Yongyi</creatorcontrib><description>Capturing sufficient global context and rich spatial structure information is critical for dense prediction tasks. Convolutional Neural Network (CNN) is particularly adept at modeling fine-grained local features, while Transformer excels at modeling global context information. It is evident that CNN and Transformer exhibit complementary characteristics. Exploring the design of a network, that efficiently fuses these two models to leverage their strengths fully and achieve more accurate detection, represents a promising and worthwhile research topic. In this paper, we introduce a novel CNN-Transformer Iterative Fusion Network (CTIF-Net) for salient object detection. It efficiently combines CNN and Transformer to achieve superior performance by using a parallel dual encoder structure and a feature iterative fusion module. Firstly, CTIF-Net extracts features from the image using the CNN and the Transformer, respectively. Secondly, two feature convertors and a feature iterative fusion module are employed to combine and iteratively refine the two sets of features. The experimental results on multiple SOD datasets show that CTIF-Net outperforms 17 state-of-the-art methods, achieving higher performance in various mainstream evaluation metrics such as F-measure, S-measure, and MAE value. Code can be found at https://github.com/danielfaster/CTIF-Net/ .</description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2023.3321190</identifier><identifier>CODEN: ITCTEM</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Artificial neural networks ; CNN ; Context ; Decoding ; Feature extraction ; iterative fusion ; Iterative methods ; Modelling ; Modules ; Object recognition ; Performance evaluation ; Salience ; salient object detection ; Semantics ; Task analysis ; transformer ; Transformers ; Visualization</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2024-05, Vol.34 (5), p.3795-3805</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c247t-e2e37d4b537e8e8e9a6beeeb1d0352aa226abc3d7b403e7f84bf48b641f2e9133</cites><orcidid>0000-0002-0954-734X ; 0000-0001-6687-8367 ; 0000-0002-8559-1801 ; 0009-0001-0613-5593 ; 0000-0001-6558-9395</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10268450$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10268450$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Yuan, Junbin</creatorcontrib><creatorcontrib>Zhu, Aiqing</creatorcontrib><creatorcontrib>Xu, Qingzhen</creatorcontrib><creatorcontrib>Wattanachote, Kanoksak</creatorcontrib><creatorcontrib>Gong, Yongyi</creatorcontrib><title>CTIF-Net: A CNN-Transformer Iterative Fusion Network for Salient Object Detection</title><title>IEEE transactions on circuits and systems for video technology</title><addtitle>TCSVT</addtitle><description>Capturing sufficient global context and rich spatial structure information is critical for dense prediction tasks. Convolutional Neural Network (CNN) is particularly adept at modeling fine-grained local features, while Transformer excels at modeling global context information. It is evident that CNN and Transformer exhibit complementary characteristics. Exploring the design of a network, that efficiently fuses these two models to leverage their strengths fully and achieve more accurate detection, represents a promising and worthwhile research topic. In this paper, we introduce a novel CNN-Transformer Iterative Fusion Network (CTIF-Net) for salient object detection. It efficiently combines CNN and Transformer to achieve superior performance by using a parallel dual encoder structure and a feature iterative fusion module. Firstly, CTIF-Net extracts features from the image using the CNN and the Transformer, respectively. Secondly, two feature convertors and a feature iterative fusion module are employed to combine and iteratively refine the two sets of features. The experimental results on multiple SOD datasets show that CTIF-Net outperforms 17 state-of-the-art methods, achieving higher performance in various mainstream evaluation metrics such as F-measure, S-measure, and MAE value. Code can be found at https://github.com/danielfaster/CTIF-Net/ .</description><subject>Artificial neural networks</subject><subject>CNN</subject><subject>Context</subject><subject>Decoding</subject><subject>Feature extraction</subject><subject>iterative fusion</subject><subject>Iterative methods</subject><subject>Modelling</subject><subject>Modules</subject><subject>Object recognition</subject><subject>Performance evaluation</subject><subject>Salience</subject><subject>salient object detection</subject><subject>Semantics</subject><subject>Task analysis</subject><subject>transformer</subject><subject>Transformers</subject><subject>Visualization</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkE1LAzEQhoMoWD_-gHgIeE7N527WW1mtFkqLdPUaku0sbG13a5Iq_ntT24PM4R2Y950ZHoRuGB0yRov7qly8V0NOuRgKwRkr6AkaMKU04Zyq09RTxYjmTJ2jixBWlDKpZT5Ar2U1GZMZxAc8wuVsRipvu9D0fgMeTyJ4G9svwONdaPsOJ9937z9wmuOFXbfQRTx3K6gjfoSYJJmu0Flj1wGuj3qJ3sZPVflCpvPnSTmakprLPBLgIPKldErkoFMVNnMA4NiSCsWt5TyzrhbL3EkqIG-0dI3ULpOs4VAwIS7R3WHv1vefOwjRrPqd79JJI6gSvNBaseTiB1ft-xA8NGbr2431P4ZRs0dn_tCZPTpzRJdCt4dQmz76F-CZloqKX8COae8</recordid><startdate>20240501</startdate><enddate>20240501</enddate><creator>Yuan, Junbin</creator><creator>Zhu, Aiqing</creator><creator>Xu, Qingzhen</creator><creator>Wattanachote, Kanoksak</creator><creator>Gong, Yongyi</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-0954-734X</orcidid><orcidid>https://orcid.org/0000-0001-6687-8367</orcidid><orcidid>https://orcid.org/0000-0002-8559-1801</orcidid><orcidid>https://orcid.org/0009-0001-0613-5593</orcidid><orcidid>https://orcid.org/0000-0001-6558-9395</orcidid></search><sort><creationdate>20240501</creationdate><title>CTIF-Net: A CNN-Transformer Iterative Fusion Network for Salient Object Detection</title><author>Yuan, Junbin ; Zhu, Aiqing ; Xu, Qingzhen ; Wattanachote, Kanoksak ; Gong, Yongyi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c247t-e2e37d4b537e8e8e9a6beeeb1d0352aa226abc3d7b403e7f84bf48b641f2e9133</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Artificial neural networks</topic><topic>CNN</topic><topic>Context</topic><topic>Decoding</topic><topic>Feature extraction</topic><topic>iterative fusion</topic><topic>Iterative methods</topic><topic>Modelling</topic><topic>Modules</topic><topic>Object recognition</topic><topic>Performance evaluation</topic><topic>Salience</topic><topic>salient object detection</topic><topic>Semantics</topic><topic>Task analysis</topic><topic>transformer</topic><topic>Transformers</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Yuan, Junbin</creatorcontrib><creatorcontrib>Zhu, Aiqing</creatorcontrib><creatorcontrib>Xu, Qingzhen</creatorcontrib><creatorcontrib>Wattanachote, Kanoksak</creatorcontrib><creatorcontrib>Gong, Yongyi</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Yuan, Junbin</au><au>Zhu, Aiqing</au><au>Xu, Qingzhen</au><au>Wattanachote, Kanoksak</au><au>Gong, Yongyi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>CTIF-Net: A CNN-Transformer Iterative Fusion Network for Salient Object Detection</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><stitle>TCSVT</stitle><date>2024-05-01</date><risdate>2024</risdate><volume>34</volume><issue>5</issue><spage>3795</spage><epage>3805</epage><pages>3795-3805</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><coden>ITCTEM</coden><abstract>Capturing sufficient global context and rich spatial structure information is critical for dense prediction tasks. Convolutional Neural Network (CNN) is particularly adept at modeling fine-grained local features, while Transformer excels at modeling global context information. It is evident that CNN and Transformer exhibit complementary characteristics. Exploring the design of a network, that efficiently fuses these two models to leverage their strengths fully and achieve more accurate detection, represents a promising and worthwhile research topic. In this paper, we introduce a novel CNN-Transformer Iterative Fusion Network (CTIF-Net) for salient object detection. It efficiently combines CNN and Transformer to achieve superior performance by using a parallel dual encoder structure and a feature iterative fusion module. Firstly, CTIF-Net extracts features from the image using the CNN and the Transformer, respectively. Secondly, two feature convertors and a feature iterative fusion module are employed to combine and iteratively refine the two sets of features. The experimental results on multiple SOD datasets show that CTIF-Net outperforms 17 state-of-the-art methods, achieving higher performance in various mainstream evaluation metrics such as F-measure, S-measure, and MAE value. Code can be found at https://github.com/danielfaster/CTIF-Net/ .</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TCSVT.2023.3321190</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0002-0954-734X</orcidid><orcidid>https://orcid.org/0000-0001-6687-8367</orcidid><orcidid>https://orcid.org/0000-0002-8559-1801</orcidid><orcidid>https://orcid.org/0009-0001-0613-5593</orcidid><orcidid>https://orcid.org/0000-0001-6558-9395</orcidid></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 1051-8215 |
ispartof | IEEE transactions on circuits and systems for video technology, 2024-05, Vol.34 (5), p.3795-3805 |
issn | 1051-8215 1558-2205 |
language | eng |
recordid | cdi_ieee_primary_10268450 |
source | IEEE Electronic Library (IEL) |
subjects | Artificial neural networks CNN Context Decoding Feature extraction iterative fusion Iterative methods Modelling Modules Object recognition Performance evaluation Salience salient object detection Semantics Task analysis transformer Transformers Visualization |
title | CTIF-Net: A CNN-Transformer Iterative Fusion Network for Salient Object Detection |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-12T16%3A02%3A19IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=CTIF-Net:%20A%20CNN-Transformer%20Iterative%20Fusion%20Network%20for%20Salient%20Object%20Detection&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Yuan,%20Junbin&rft.date=2024-05-01&rft.volume=34&rft.issue=5&rft.spage=3795&rft.epage=3805&rft.pages=3795-3805&rft.issn=1051-8215&rft.eissn=1558-2205&rft.coden=ITCTEM&rft_id=info:doi/10.1109/TCSVT.2023.3321190&rft_dat=%3Cproquest_RIE%3E3053298851%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3053298851&rft_id=info:pmid/&rft_ieee_id=10268450&rfr_iscdi=true |