Regular Constrained Multimodal Fusion for Image Captioning

More diverse and closer to human-like captions are of paramount importance in image captioning. Recent research has achieved significant advancements, with the majority adopting end-to-end encoder-decoder architectures that integrate specific feature-text processing. However, the homogeneity of thei...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on circuits and systems for video technology 2024-11, Vol.34 (11), p.11900-11913
Hauptverfasser:	Wang, Liya, Chen, Haipeng, Liu, Yu, Lyu, Yingda
Format:	Artikel
Sprache:	eng
Schlagworte:	Cognition Constraints Data mining Decoding Effectiveness Encoders-Decoders Feature extraction Homogeneity Image caption Modules multimodal fusion Reasoning regular branch Semantics Source code Training transformer Transformers Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	11913
container_issue	11
container_start_page	11900
container_title	IEEE transactions on circuits and systems for video technology
container_volume	34
creator	Wang, Liya Chen, Haipeng Liu, Yu Lyu, Yingda
description	More diverse and closer to human-like captions are of paramount importance in image captioning. Recent research has achieved significant advancements, with the majority adopting end-to-end encoder-decoder architectures that integrate specific feature-text processing. However, the homogeneity of their model structures, the simplicity or complexity of feature-text fusion, and the uniformity of training objectives have all to some extent affected the diversity and effectiveness of caption generation, thus limiting the potential applications of this task. Therefore, in this paper, we propose the Regular Constrained Multimodal Fusion (RCMF) method for image captioning to better integrate information across and within modalities, while also approaching human-like fine-grained semantic perception and relationship reasoning capabilities. Initially, our RCMF preprocesses images using a Swin-Transformer and then an extended encoder with a new intra-modal fusion module, utilizing window-focused linear attention to capture features and leveraging refined grid and global visual features. By combining text features, RCMF employs a cross-modal fusion module and decoder to deeply model the interaction between text and image. Additionally, RCMF first introduces a new additional regulatory modal fusion reasoning (MFR) branch, which surpasses the above architectures. Its MFR loss combined with cross-entropy loss forms a new training objective strategy, effectively mining fine-grained relationships between images and text, perceiving the semantic information of images and their corresponding captions, thereby regulating the generated captions to be more diverse and human-like. Experimental results based on the MS COCO 2014 dataset, particularly under the same experimental conditions, demonstrate the outstanding performance of our method, especially in terms of METEOR, ROUGE-L, CIDEr, and SPICE metrics. Visualization results further intuitively confirm the effectiveness of our RCMF method. Source code in https://github.com/200084/RCMF-for-image-caption .
doi_str_mv	10.1109/TCSVT.2024.3425513
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_3133497737</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10589672</ieee_id><sourcerecordid>3133497737</sourcerecordid><originalsourceid>FETCH-LOGICAL-c927-e6bfcf9b9f731fed872769bb9bfb2d6c010b1ace926a61c40f34abbe3c35a6b93</originalsourceid><addsrcrecordid>eNpNkE1Lw0AQhhdRsFb_gHgIeE7d2c-sNwlWCxVBg9dld7NbUtKk7iYH_72p7cHTvAzvMwMPQreAFwBYPVTl51e1IJiwBWWEc6BnaAacFzkhmJ9PGXPICwL8El2ltMUYWMHkDD1--M3YmpiVfZeGaJrO19nb2A7Nrq9Nmy3H1PRdFvqYrXZm47PS7Idp03Sba3QRTJv8zWnOUbV8rsrXfP3-siqf1rlTROZe2OCCsipICsHXhSRSKGuVDZbUwmHAFozziggjwDEcKDPWeuooN8IqOkf3x7P72H-PPg1624-xmz5qCpQyJSWVU4scWy72KUUf9D42OxN_NGB9UKT_FOmDIn1SNEF3R6jx3v8DeKGEJPQXUvZjEg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3133497737</pqid></control><display><type>article</type><title>Regular Constrained Multimodal Fusion for Image Captioning</title><source>IEEE Electronic Library (IEL)</source><creator>Wang, Liya ; Chen, Haipeng ; Liu, Yu ; Lyu, Yingda</creator><creatorcontrib>Wang, Liya ; Chen, Haipeng ; Liu, Yu ; Lyu, Yingda</creatorcontrib><description>More diverse and closer to human-like captions are of paramount importance in image captioning. Recent research has achieved significant advancements, with the majority adopting end-to-end encoder-decoder architectures that integrate specific feature-text processing. However, the homogeneity of their model structures, the simplicity or complexity of feature-text fusion, and the uniformity of training objectives have all to some extent affected the diversity and effectiveness of caption generation, thus limiting the potential applications of this task. Therefore, in this paper, we propose the Regular Constrained Multimodal Fusion (RCMF) method for image captioning to better integrate information across and within modalities, while also approaching human-like fine-grained semantic perception and relationship reasoning capabilities. Initially, our RCMF preprocesses images using a Swin-Transformer and then an extended encoder with a new intra-modal fusion module, utilizing window-focused linear attention to capture features and leveraging refined grid and global visual features. By combining text features, RCMF employs a cross-modal fusion module and decoder to deeply model the interaction between text and image. Additionally, RCMF first introduces a new additional regulatory modal fusion reasoning (MFR) branch, which surpasses the above architectures. Its MFR loss combined with cross-entropy loss forms a new training objective strategy, effectively mining fine-grained relationships between images and text, perceiving the semantic information of images and their corresponding captions, thereby regulating the generated captions to be more diverse and human-like. Experimental results based on the MS COCO 2014 dataset, particularly under the same experimental conditions, demonstrate the outstanding performance of our method, especially in terms of METEOR, ROUGE-L, CIDEr, and SPICE metrics. Visualization results further intuitively confirm the effectiveness of our RCMF method. Source code in https://github.com/200084/RCMF-for-image-caption .</description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2024.3425513</identifier><identifier>CODEN: ITCTEM</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Cognition ; Constraints ; Data mining ; Decoding ; Effectiveness ; Encoders-Decoders ; Feature extraction ; Homogeneity ; Image caption ; Modules ; multimodal fusion ; Reasoning ; regular branch ; Semantics ; Source code ; Training ; transformer ; Transformers ; Visualization</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2024-11, Vol.34 (11), p.11900-11913</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c927-e6bfcf9b9f731fed872769bb9bfb2d6c010b1ace926a61c40f34abbe3c35a6b93</cites><orcidid>0000-0002-9410-4120 ; 0009-0003-4382-0156 ; 0000-0003-3023-3027 ; 0000-0002-2037-6692</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10589672$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27903,27904,54737</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10589672$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Wang, Liya</creatorcontrib><creatorcontrib>Chen, Haipeng</creatorcontrib><creatorcontrib>Liu, Yu</creatorcontrib><creatorcontrib>Lyu, Yingda</creatorcontrib><title>Regular Constrained Multimodal Fusion for Image Captioning</title><title>IEEE transactions on circuits and systems for video technology</title><addtitle>TCSVT</addtitle><description>More diverse and closer to human-like captions are of paramount importance in image captioning. Recent research has achieved significant advancements, with the majority adopting end-to-end encoder-decoder architectures that integrate specific feature-text processing. However, the homogeneity of their model structures, the simplicity or complexity of feature-text fusion, and the uniformity of training objectives have all to some extent affected the diversity and effectiveness of caption generation, thus limiting the potential applications of this task. Therefore, in this paper, we propose the Regular Constrained Multimodal Fusion (RCMF) method for image captioning to better integrate information across and within modalities, while also approaching human-like fine-grained semantic perception and relationship reasoning capabilities. Initially, our RCMF preprocesses images using a Swin-Transformer and then an extended encoder with a new intra-modal fusion module, utilizing window-focused linear attention to capture features and leveraging refined grid and global visual features. By combining text features, RCMF employs a cross-modal fusion module and decoder to deeply model the interaction between text and image. Additionally, RCMF first introduces a new additional regulatory modal fusion reasoning (MFR) branch, which surpasses the above architectures. Its MFR loss combined with cross-entropy loss forms a new training objective strategy, effectively mining fine-grained relationships between images and text, perceiving the semantic information of images and their corresponding captions, thereby regulating the generated captions to be more diverse and human-like. Experimental results based on the MS COCO 2014 dataset, particularly under the same experimental conditions, demonstrate the outstanding performance of our method, especially in terms of METEOR, ROUGE-L, CIDEr, and SPICE metrics. Visualization results further intuitively confirm the effectiveness of our RCMF method. Source code in https://github.com/200084/RCMF-for-image-caption .</description><subject>Cognition</subject><subject>Constraints</subject><subject>Data mining</subject><subject>Decoding</subject><subject>Effectiveness</subject><subject>Encoders-Decoders</subject><subject>Feature extraction</subject><subject>Homogeneity</subject><subject>Image caption</subject><subject>Modules</subject><subject>multimodal fusion</subject><subject>Reasoning</subject><subject>regular branch</subject><subject>Semantics</subject><subject>Source code</subject><subject>Training</subject><subject>transformer</subject><subject>Transformers</subject><subject>Visualization</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkE1Lw0AQhhdRsFb_gHgIeE7d2c-sNwlWCxVBg9dld7NbUtKk7iYH_72p7cHTvAzvMwMPQreAFwBYPVTl51e1IJiwBWWEc6BnaAacFzkhmJ9PGXPICwL8El2ltMUYWMHkDD1--M3YmpiVfZeGaJrO19nb2A7Nrq9Nmy3H1PRdFvqYrXZm47PS7Idp03Sba3QRTJv8zWnOUbV8rsrXfP3-siqf1rlTROZe2OCCsipICsHXhSRSKGuVDZbUwmHAFozziggjwDEcKDPWeuooN8IqOkf3x7P72H-PPg1624-xmz5qCpQyJSWVU4scWy72KUUf9D42OxN_NGB9UKT_FOmDIn1SNEF3R6jx3v8DeKGEJPQXUvZjEg</recordid><startdate>202411</startdate><enddate>202411</enddate><creator>Wang, Liya</creator><creator>Chen, Haipeng</creator><creator>Liu, Yu</creator><creator>Lyu, Yingda</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-9410-4120</orcidid><orcidid>https://orcid.org/0009-0003-4382-0156</orcidid><orcidid>https://orcid.org/0000-0003-3023-3027</orcidid><orcidid>https://orcid.org/0000-0002-2037-6692</orcidid></search><sort><creationdate>202411</creationdate><title>Regular Constrained Multimodal Fusion for Image Captioning</title><author>Wang, Liya ; Chen, Haipeng ; Liu, Yu ; Lyu, Yingda</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c927-e6bfcf9b9f731fed872769bb9bfb2d6c010b1ace926a61c40f34abbe3c35a6b93</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Cognition</topic><topic>Constraints</topic><topic>Data mining</topic><topic>Decoding</topic><topic>Effectiveness</topic><topic>Encoders-Decoders</topic><topic>Feature extraction</topic><topic>Homogeneity</topic><topic>Image caption</topic><topic>Modules</topic><topic>multimodal fusion</topic><topic>Reasoning</topic><topic>regular branch</topic><topic>Semantics</topic><topic>Source code</topic><topic>Training</topic><topic>transformer</topic><topic>Transformers</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wang, Liya</creatorcontrib><creatorcontrib>Chen, Haipeng</creatorcontrib><creatorcontrib>Liu, Yu</creatorcontrib><creatorcontrib>Lyu, Yingda</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wang, Liya</au><au>Chen, Haipeng</au><au>Liu, Yu</au><au>Lyu, Yingda</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Regular Constrained Multimodal Fusion for Image Captioning</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><stitle>TCSVT</stitle><date>2024-11</date><risdate>2024</risdate><volume>34</volume><issue>11</issue><spage>11900</spage><epage>11913</epage><pages>11900-11913</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><coden>ITCTEM</coden><abstract>More diverse and closer to human-like captions are of paramount importance in image captioning. Recent research has achieved significant advancements, with the majority adopting end-to-end encoder-decoder architectures that integrate specific feature-text processing. However, the homogeneity of their model structures, the simplicity or complexity of feature-text fusion, and the uniformity of training objectives have all to some extent affected the diversity and effectiveness of caption generation, thus limiting the potential applications of this task. Therefore, in this paper, we propose the Regular Constrained Multimodal Fusion (RCMF) method for image captioning to better integrate information across and within modalities, while also approaching human-like fine-grained semantic perception and relationship reasoning capabilities. Initially, our RCMF preprocesses images using a Swin-Transformer and then an extended encoder with a new intra-modal fusion module, utilizing window-focused linear attention to capture features and leveraging refined grid and global visual features. By combining text features, RCMF employs a cross-modal fusion module and decoder to deeply model the interaction between text and image. Additionally, RCMF first introduces a new additional regulatory modal fusion reasoning (MFR) branch, which surpasses the above architectures. Its MFR loss combined with cross-entropy loss forms a new training objective strategy, effectively mining fine-grained relationships between images and text, perceiving the semantic information of images and their corresponding captions, thereby regulating the generated captions to be more diverse and human-like. Experimental results based on the MS COCO 2014 dataset, particularly under the same experimental conditions, demonstrate the outstanding performance of our method, especially in terms of METEOR, ROUGE-L, CIDEr, and SPICE metrics. Visualization results further intuitively confirm the effectiveness of our RCMF method. Source code in https://github.com/200084/RCMF-for-image-caption .</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TCSVT.2024.3425513</doi><tpages>14</tpages><orcidid>https://orcid.org/0000-0002-9410-4120</orcidid><orcidid>https://orcid.org/0009-0003-4382-0156</orcidid><orcidid>https://orcid.org/0000-0003-3023-3027</orcidid><orcidid>https://orcid.org/0000-0002-2037-6692</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1051-8215
ispartof	IEEE transactions on circuits and systems for video technology, 2024-11, Vol.34 (11), p.11900-11913
issn	1051-8215 1558-2205
language	eng
recordid	cdi_proquest_journals_3133497737
source	IEEE Electronic Library (IEL)
subjects	Cognition Constraints Data mining Decoding Effectiveness Encoders-Decoders Feature extraction Homogeneity Image caption Modules multimodal fusion Reasoning regular branch Semantics Source code Training transformer Transformers Visualization
title	Regular Constrained Multimodal Fusion for Image Captioning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-22T00%3A17%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Regular%20Constrained%20Multimodal%20Fusion%20for%20Image%20Captioning&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Wang,%20Liya&rft.date=2024-11&rft.volume=34&rft.issue=11&rft.spage=11900&rft.epage=11913&rft.pages=11900-11913&rft.issn=1051-8215&rft.eissn=1558-2205&rft.coden=ITCTEM&rft_id=info:doi/10.1109/TCSVT.2024.3425513&rft_dat=%3Cproquest_RIE%3E3133497737%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3133497737&rft_id=info:pmid/&rft_ieee_id=10589672&rfr_iscdi=true