A Text-Guided Generation and Refinement Model for Image Captioning

A high-quality image description requires not only the logic and fluency of language but also the richness and accuracy ofcontent. However, due to the semantic gap between vision and language, most existing image captioning approaches thatdirectly learn the cross-modal mapping from vision to languag...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on multimedia 2023, Vol.25, p.2966-2977
Hauptverfasser:	Wang, Depeng, Hu, Zhenzhen, Zhou, Yuanen, Hong, Richang, Wang, Meng
Format:	Artikel
Sprache:	eng
Schlagworte:	Attention mechanism Coders Cognition Decoding generating and refining decoder Generators image captioning Image quality Modules Refining Salience Semantics Sports Sports equipment text-guided Training Vision Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	2977
container_issue
container_start_page	2966
container_title	IEEE transactions on multimedia
container_volume	25
creator	Wang, Depeng Hu, Zhenzhen Zhou, Yuanen Hong, Richang Wang, Meng
description	A high-quality image description requires not only the logic and fluency of language but also the richness and accuracy ofcontent. However, due to the semantic gap between vision and language, most existing image captioning approaches thatdirectly learn the cross-modal mapping from vision to language are difficult to meet these two requirements simultaneously. Inspired by the progressive learning mechanism, we trace the "generating + refining" route and propose a novel Text-GuidedGeneration and Refinement (dubbed as TGGAR) model with assistance from the guide text to improve the quality of captions.The guide text is selected from the training set according to content similarity, then utilized to explore salient objects andextend candidate words. Specifically, we follow the encoderdecoder architecture, and design a Text-Guided Relation Encoder(TGRE) to learn the visual representation that is more consistent with human visual cognition. Besides, we divide the decoderpart into two sub-modules: a Generator for the primary sentence generation and a Refiner for the sentence refinement.Generator, consisting of a standard LSTM and a Gate on Attention (GOA) module, aims to generate the primary sentencelogically and fluently. Refiner contains a caption encoder module, an attentionbased LSTM and a GOA module, whichiteratively modifies the details in the primary caption to make captions rich and accurate. Extensive experiments on theMSCOCO captioning dataset demonstrate our framework with fewer parameters remains comparable to transformer-basedmethods, and achieves state-of-the-art performance compared with other relevant approaches.
doi_str_mv	10.1109/TMM.2022.3154149
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2847965769</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9720933</ieee_id><sourcerecordid>2847965769</sourcerecordid><originalsourceid>FETCH-LOGICAL-c291t-6f4afa1f79e1543729a0caca56aa1baa86255ddb0ca13dfb7166ef9c4909ec1f3</originalsourceid><addsrcrecordid>eNo9kE1LAzEQhoMoWD_ugpeA560zSXbTHGvRWmgRpJ7DdHdStrS7NbsF_femtHiaYXjemeER4gFhiAjueblYDBUoNdSYGzTuQgzQGcwArL1Mfa4gcwrhWtx03QYATQ52IF7Gcsk_fTY91BVXcsoNR-rrtpHUVPKTQ93wjpteLtqKtzK0Uc52tGY5of0Rq5v1nbgKtO34_lxvxdfb63Lyns0_prPJeJ6VymGfFcFQIAzWcfpQW-UISiopL4hwRTQqVJ5X1SoNUVdhZbEoOLjSOHBcYtC34um0dx_b7wN3vd-0h9ikk16NjHVFbguXKDhRZWy7LnLw-1jvKP56BH805ZMpfzTlz6ZS5PEUqZn5H3dWgdNa_wFFcGPd</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2847965769</pqid></control><display><type>article</type><title>A Text-Guided Generation and Refinement Model for Image Captioning</title><source>IEEE Electronic Library (IEL)</source><creator>Wang, Depeng ; Hu, Zhenzhen ; Zhou, Yuanen ; Hong, Richang ; Wang, Meng</creator><creatorcontrib>Wang, Depeng ; Hu, Zhenzhen ; Zhou, Yuanen ; Hong, Richang ; Wang, Meng</creatorcontrib><description>A high-quality image description requires not only the logic and fluency of language but also the richness and accuracy ofcontent. However, due to the semantic gap between vision and language, most existing image captioning approaches thatdirectly learn the cross-modal mapping from vision to language are difficult to meet these two requirements simultaneously. Inspired by the progressive learning mechanism, we trace the "generating + refining" route and propose a novel Text-GuidedGeneration and Refinement (dubbed as TGGAR) model with assistance from the guide text to improve the quality of captions.The guide text is selected from the training set according to content similarity, then utilized to explore salient objects andextend candidate words. Specifically, we follow the encoderdecoder architecture, and design a Text-Guided Relation Encoder(TGRE) to learn the visual representation that is more consistent with human visual cognition. Besides, we divide the decoderpart into two sub-modules: a Generator for the primary sentence generation and a Refiner for the sentence refinement.Generator, consisting of a standard LSTM and a Gate on Attention (GOA) module, aims to generate the primary sentencelogically and fluently. Refiner contains a caption encoder module, an attentionbased LSTM and a GOA module, whichiteratively modifies the details in the primary caption to make captions rich and accurate. Extensive experiments on theMSCOCO captioning dataset demonstrate our framework with fewer parameters remains comparable to transformer-basedmethods, and achieves state-of-the-art performance compared with other relevant approaches.</description><identifier>ISSN: 1520-9210</identifier><identifier>EISSN: 1941-0077</identifier><identifier>DOI: 10.1109/TMM.2022.3154149</identifier><identifier>CODEN: ITMUF8</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Attention mechanism ; Coders ; Cognition ; Decoding ; generating and refining decoder ; Generators ; image captioning ; Image quality ; Modules ; Refining ; Salience ; Semantics ; Sports ; Sports equipment ; text-guided ; Training ; Vision ; Visualization</subject><ispartof>IEEE transactions on multimedia, 2023, Vol.25, p.2966-2977</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c291t-6f4afa1f79e1543729a0caca56aa1baa86255ddb0ca13dfb7166ef9c4909ec1f3</citedby><cites>FETCH-LOGICAL-c291t-6f4afa1f79e1543729a0caca56aa1baa86255ddb0ca13dfb7166ef9c4909ec1f3</cites><orcidid>0000-0001-5461-3986 ; 0000-0002-3094-7735 ; 0000-0003-1042-8361 ; 0000-0002-4986-3611 ; 0000-0001-6786-0732</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9720933$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,4010,27900,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9720933$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Wang, Depeng</creatorcontrib><creatorcontrib>Hu, Zhenzhen</creatorcontrib><creatorcontrib>Zhou, Yuanen</creatorcontrib><creatorcontrib>Hong, Richang</creatorcontrib><creatorcontrib>Wang, Meng</creatorcontrib><title>A Text-Guided Generation and Refinement Model for Image Captioning</title><title>IEEE transactions on multimedia</title><addtitle>TMM</addtitle><description>A high-quality image description requires not only the logic and fluency of language but also the richness and accuracy ofcontent. However, due to the semantic gap between vision and language, most existing image captioning approaches thatdirectly learn the cross-modal mapping from vision to language are difficult to meet these two requirements simultaneously. Inspired by the progressive learning mechanism, we trace the "generating + refining" route and propose a novel Text-GuidedGeneration and Refinement (dubbed as TGGAR) model with assistance from the guide text to improve the quality of captions.The guide text is selected from the training set according to content similarity, then utilized to explore salient objects andextend candidate words. Specifically, we follow the encoderdecoder architecture, and design a Text-Guided Relation Encoder(TGRE) to learn the visual representation that is more consistent with human visual cognition. Besides, we divide the decoderpart into two sub-modules: a Generator for the primary sentence generation and a Refiner for the sentence refinement.Generator, consisting of a standard LSTM and a Gate on Attention (GOA) module, aims to generate the primary sentencelogically and fluently. Refiner contains a caption encoder module, an attentionbased LSTM and a GOA module, whichiteratively modifies the details in the primary caption to make captions rich and accurate. Extensive experiments on theMSCOCO captioning dataset demonstrate our framework with fewer parameters remains comparable to transformer-basedmethods, and achieves state-of-the-art performance compared with other relevant approaches.</description><subject>Attention mechanism</subject><subject>Coders</subject><subject>Cognition</subject><subject>Decoding</subject><subject>generating and refining decoder</subject><subject>Generators</subject><subject>image captioning</subject><subject>Image quality</subject><subject>Modules</subject><subject>Refining</subject><subject>Salience</subject><subject>Semantics</subject><subject>Sports</subject><subject>Sports equipment</subject><subject>text-guided</subject><subject>Training</subject><subject>Vision</subject><subject>Visualization</subject><issn>1520-9210</issn><issn>1941-0077</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kE1LAzEQhoMoWD_ugpeA560zSXbTHGvRWmgRpJ7DdHdStrS7NbsF_femtHiaYXjemeER4gFhiAjueblYDBUoNdSYGzTuQgzQGcwArL1Mfa4gcwrhWtx03QYATQ52IF7Gcsk_fTY91BVXcsoNR-rrtpHUVPKTQ93wjpteLtqKtzK0Uc52tGY5of0Rq5v1nbgKtO34_lxvxdfb63Lyns0_prPJeJ6VymGfFcFQIAzWcfpQW-UISiopL4hwRTQqVJ5X1SoNUVdhZbEoOLjSOHBcYtC34um0dx_b7wN3vd-0h9ikk16NjHVFbguXKDhRZWy7LnLw-1jvKP56BH805ZMpfzTlz6ZS5PEUqZn5H3dWgdNa_wFFcGPd</recordid><startdate>2023</startdate><enddate>2023</enddate><creator>Wang, Depeng</creator><creator>Hu, Zhenzhen</creator><creator>Zhou, Yuanen</creator><creator>Hong, Richang</creator><creator>Wang, Meng</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-5461-3986</orcidid><orcidid>https://orcid.org/0000-0002-3094-7735</orcidid><orcidid>https://orcid.org/0000-0003-1042-8361</orcidid><orcidid>https://orcid.org/0000-0002-4986-3611</orcidid><orcidid>https://orcid.org/0000-0001-6786-0732</orcidid></search><sort><creationdate>2023</creationdate><title>A Text-Guided Generation and Refinement Model for Image Captioning</title><author>Wang, Depeng ; Hu, Zhenzhen ; Zhou, Yuanen ; Hong, Richang ; Wang, Meng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c291t-6f4afa1f79e1543729a0caca56aa1baa86255ddb0ca13dfb7166ef9c4909ec1f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Attention mechanism</topic><topic>Coders</topic><topic>Cognition</topic><topic>Decoding</topic><topic>generating and refining decoder</topic><topic>Generators</topic><topic>image captioning</topic><topic>Image quality</topic><topic>Modules</topic><topic>Refining</topic><topic>Salience</topic><topic>Semantics</topic><topic>Sports</topic><topic>Sports equipment</topic><topic>text-guided</topic><topic>Training</topic><topic>Vision</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wang, Depeng</creatorcontrib><creatorcontrib>Hu, Zhenzhen</creatorcontrib><creatorcontrib>Zhou, Yuanen</creatorcontrib><creatorcontrib>Hong, Richang</creatorcontrib><creatorcontrib>Wang, Meng</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on multimedia</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wang, Depeng</au><au>Hu, Zhenzhen</au><au>Zhou, Yuanen</au><au>Hong, Richang</au><au>Wang, Meng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Text-Guided Generation and Refinement Model for Image Captioning</atitle><jtitle>IEEE transactions on multimedia</jtitle><stitle>TMM</stitle><date>2023</date><risdate>2023</risdate><volume>25</volume><spage>2966</spage><epage>2977</epage><pages>2966-2977</pages><issn>1520-9210</issn><eissn>1941-0077</eissn><coden>ITMUF8</coden><abstract>A high-quality image description requires not only the logic and fluency of language but also the richness and accuracy ofcontent. However, due to the semantic gap between vision and language, most existing image captioning approaches thatdirectly learn the cross-modal mapping from vision to language are difficult to meet these two requirements simultaneously. Inspired by the progressive learning mechanism, we trace the "generating + refining" route and propose a novel Text-GuidedGeneration and Refinement (dubbed as TGGAR) model with assistance from the guide text to improve the quality of captions.The guide text is selected from the training set according to content similarity, then utilized to explore salient objects andextend candidate words. Specifically, we follow the encoderdecoder architecture, and design a Text-Guided Relation Encoder(TGRE) to learn the visual representation that is more consistent with human visual cognition. Besides, we divide the decoderpart into two sub-modules: a Generator for the primary sentence generation and a Refiner for the sentence refinement.Generator, consisting of a standard LSTM and a Gate on Attention (GOA) module, aims to generate the primary sentencelogically and fluently. Refiner contains a caption encoder module, an attentionbased LSTM and a GOA module, whichiteratively modifies the details in the primary caption to make captions rich and accurate. Extensive experiments on theMSCOCO captioning dataset demonstrate our framework with fewer parameters remains comparable to transformer-basedmethods, and achieves state-of-the-art performance compared with other relevant approaches.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TMM.2022.3154149</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0001-5461-3986</orcidid><orcidid>https://orcid.org/0000-0002-3094-7735</orcidid><orcidid>https://orcid.org/0000-0003-1042-8361</orcidid><orcidid>https://orcid.org/0000-0002-4986-3611</orcidid><orcidid>https://orcid.org/0000-0001-6786-0732</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1520-9210
ispartof	IEEE transactions on multimedia, 2023, Vol.25, p.2966-2977
issn	1520-9210 1941-0077
language	eng
recordid	cdi_proquest_journals_2847965769
source	IEEE Electronic Library (IEL)
subjects	Attention mechanism Coders Cognition Decoding generating and refining decoder Generators image captioning Image quality Modules Refining Salience Semantics Sports Sports equipment text-guided Training Vision Visualization
title	A Text-Guided Generation and Refinement Model for Image Captioning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T21%3A54%3A30IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Text-Guided%20Generation%20and%20Refinement%20Model%20for%20Image%20Captioning&rft.jtitle=IEEE%20transactions%20on%20multimedia&rft.au=Wang,%20Depeng&rft.date=2023&rft.volume=25&rft.spage=2966&rft.epage=2977&rft.pages=2966-2977&rft.issn=1520-9210&rft.eissn=1941-0077&rft.coden=ITMUF8&rft_id=info:doi/10.1109/TMM.2022.3154149&rft_dat=%3Cproquest_RIE%3E2847965769%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2847965769&rft_id=info:pmid/&rft_ieee_id=9720933&rfr_iscdi=true