A Text-Guided Generation and Refinement Model for Image Captioning

A high-quality image description requires not only the logic and fluency of language but also the richness and accuracy ofcontent. However, due to the semantic gap between vision and language, most existing image captioning approaches thatdirectly learn the cross-modal mapping from vision to languag...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on multimedia 2023, Vol.25, p.2966-2977
Hauptverfasser: Wang, Depeng, Hu, Zhenzhen, Zhou, Yuanen, Hong, Richang, Wang, Meng
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 2977
container_issue
container_start_page 2966
container_title IEEE transactions on multimedia
container_volume 25
creator Wang, Depeng
Hu, Zhenzhen
Zhou, Yuanen
Hong, Richang
Wang, Meng
description A high-quality image description requires not only the logic and fluency of language but also the richness and accuracy ofcontent. However, due to the semantic gap between vision and language, most existing image captioning approaches thatdirectly learn the cross-modal mapping from vision to language are difficult to meet these two requirements simultaneously. Inspired by the progressive learning mechanism, we trace the "generating + refining" route and propose a novel Text-GuidedGeneration and Refinement (dubbed as TGGAR) model with assistance from the guide text to improve the quality of captions.The guide text is selected from the training set according to content similarity, then utilized to explore salient objects andextend candidate words. Specifically, we follow the encoderdecoder architecture, and design a Text-Guided Relation Encoder(TGRE) to learn the visual representation that is more consistent with human visual cognition. Besides, we divide the decoderpart into two sub-modules: a Generator for the primary sentence generation and a Refiner for the sentence refinement.Generator, consisting of a standard LSTM and a Gate on Attention (GOA) module, aims to generate the primary sentencelogically and fluently. Refiner contains a caption encoder module, an attentionbased LSTM and a GOA module, whichiteratively modifies the details in the primary caption to make captions rich and accurate. Extensive experiments on theMSCOCO captioning dataset demonstrate our framework with fewer parameters remains comparable to transformer-basedmethods, and achieves state-of-the-art performance compared with other relevant approaches.
doi_str_mv 10.1109/TMM.2022.3154149
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2847965769</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9720933</ieee_id><sourcerecordid>2847965769</sourcerecordid><originalsourceid>FETCH-LOGICAL-c291t-6f4afa1f79e1543729a0caca56aa1baa86255ddb0ca13dfb7166ef9c4909ec1f3</originalsourceid><addsrcrecordid>eNo9kE1LAzEQhoMoWD_ugpeA560zSXbTHGvRWmgRpJ7DdHdStrS7NbsF_femtHiaYXjemeER4gFhiAjueblYDBUoNdSYGzTuQgzQGcwArL1Mfa4gcwrhWtx03QYATQ52IF7Gcsk_fTY91BVXcsoNR-rrtpHUVPKTQ93wjpteLtqKtzK0Uc52tGY5of0Rq5v1nbgKtO34_lxvxdfb63Lyns0_prPJeJ6VymGfFcFQIAzWcfpQW-UISiopL4hwRTQqVJ5X1SoNUVdhZbEoOLjSOHBcYtC34um0dx_b7wN3vd-0h9ikk16NjHVFbguXKDhRZWy7LnLw-1jvKP56BH805ZMpfzTlz6ZS5PEUqZn5H3dWgdNa_wFFcGPd</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2847965769</pqid></control><display><type>article</type><title>A Text-Guided Generation and Refinement Model for Image Captioning</title><source>IEEE Electronic Library (IEL)</source><creator>Wang, Depeng ; Hu, Zhenzhen ; Zhou, Yuanen ; Hong, Richang ; Wang, Meng</creator><creatorcontrib>Wang, Depeng ; Hu, Zhenzhen ; Zhou, Yuanen ; Hong, Richang ; Wang, Meng</creatorcontrib><description>A high-quality image description requires not only the logic and fluency of language but also the richness and accuracy ofcontent. However, due to the semantic gap between vision and language, most existing image captioning approaches thatdirectly learn the cross-modal mapping from vision to language are difficult to meet these two requirements simultaneously. Inspired by the progressive learning mechanism, we trace the "generating + refining" route and propose a novel Text-GuidedGeneration and Refinement (dubbed as TGGAR) model with assistance from the guide text to improve the quality of captions.The guide text is selected from the training set according to content similarity, then utilized to explore salient objects andextend candidate words. Specifically, we follow the encoderdecoder architecture, and design a Text-Guided Relation Encoder(TGRE) to learn the visual representation that is more consistent with human visual cognition. Besides, we divide the decoderpart into two sub-modules: a Generator for the primary sentence generation and a Refiner for the sentence refinement.Generator, consisting of a standard LSTM and a Gate on Attention (GOA) module, aims to generate the primary sentencelogically and fluently. Refiner contains a caption encoder module, an attentionbased LSTM and a GOA module, whichiteratively modifies the details in the primary caption to make captions rich and accurate. Extensive experiments on theMSCOCO captioning dataset demonstrate our framework with fewer parameters remains comparable to transformer-basedmethods, and achieves state-of-the-art performance compared with other relevant approaches.</description><identifier>ISSN: 1520-9210</identifier><identifier>EISSN: 1941-0077</identifier><identifier>DOI: 10.1109/TMM.2022.3154149</identifier><identifier>CODEN: ITMUF8</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Attention mechanism ; Coders ; Cognition ; Decoding ; generating and refining decoder ; Generators ; image captioning ; Image quality ; Modules ; Refining ; Salience ; Semantics ; Sports ; Sports equipment ; text-guided ; Training ; Vision ; Visualization</subject><ispartof>IEEE transactions on multimedia, 2023, Vol.25, p.2966-2977</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c291t-6f4afa1f79e1543729a0caca56aa1baa86255ddb0ca13dfb7166ef9c4909ec1f3</citedby><cites>FETCH-LOGICAL-c291t-6f4afa1f79e1543729a0caca56aa1baa86255ddb0ca13dfb7166ef9c4909ec1f3</cites><orcidid>0000-0001-5461-3986 ; 0000-0002-3094-7735 ; 0000-0003-1042-8361 ; 0000-0002-4986-3611 ; 0000-0001-6786-0732</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9720933$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,4010,27900,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9720933$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Wang, Depeng</creatorcontrib><creatorcontrib>Hu, Zhenzhen</creatorcontrib><creatorcontrib>Zhou, Yuanen</creatorcontrib><creatorcontrib>Hong, Richang</creatorcontrib><creatorcontrib>Wang, Meng</creatorcontrib><title>A Text-Guided Generation and Refinement Model for Image Captioning</title><title>IEEE transactions on multimedia</title><addtitle>TMM</addtitle><description>A high-quality image description requires not only the logic and fluency of language but also the richness and accuracy ofcontent. However, due to the semantic gap between vision and language, most existing image captioning approaches thatdirectly learn the cross-modal mapping from vision to language are difficult to meet these two requirements simultaneously. Inspired by the progressive learning mechanism, we trace the "generating + refining" route and propose a novel Text-GuidedGeneration and Refinement (dubbed as TGGAR) model with assistance from the guide text to improve the quality of captions.The guide text is selected from the training set according to content similarity, then utilized to explore salient objects andextend candidate words. Specifically, we follow the encoderdecoder architecture, and design a Text-Guided Relation Encoder(TGRE) to learn the visual representation that is more consistent with human visual cognition. Besides, we divide the decoderpart into two sub-modules: a Generator for the primary sentence generation and a Refiner for the sentence refinement.Generator, consisting of a standard LSTM and a Gate on Attention (GOA) module, aims to generate the primary sentencelogically and fluently. Refiner contains a caption encoder module, an attentionbased LSTM and a GOA module, whichiteratively modifies the details in the primary caption to make captions rich and accurate. Extensive experiments on theMSCOCO captioning dataset demonstrate our framework with fewer parameters remains comparable to transformer-basedmethods, and achieves state-of-the-art performance compared with other relevant approaches.</description><subject>Attention mechanism</subject><subject>Coders</subject><subject>Cognition</subject><subject>Decoding</subject><subject>generating and refining decoder</subject><subject>Generators</subject><subject>image captioning</subject><subject>Image quality</subject><subject>Modules</subject><subject>Refining</subject><subject>Salience</subject><subject>Semantics</subject><subject>Sports</subject><subject>Sports equipment</subject><subject>text-guided</subject><subject>Training</subject><subject>Vision</subject><subject>Visualization</subject><issn>1520-9210</issn><issn>1941-0077</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kE1LAzEQhoMoWD_ugpeA560zSXbTHGvRWmgRpJ7DdHdStrS7NbsF_femtHiaYXjemeER4gFhiAjueblYDBUoNdSYGzTuQgzQGcwArL1Mfa4gcwrhWtx03QYATQ52IF7Gcsk_fTY91BVXcsoNR-rrtpHUVPKTQ93wjpteLtqKtzK0Uc52tGY5of0Rq5v1nbgKtO34_lxvxdfb63Lyns0_prPJeJ6VymGfFcFQIAzWcfpQW-UISiopL4hwRTQqVJ5X1SoNUVdhZbEoOLjSOHBcYtC34um0dx_b7wN3vd-0h9ikk16NjHVFbguXKDhRZWy7LnLw-1jvKP56BH805ZMpfzTlz6ZS5PEUqZn5H3dWgdNa_wFFcGPd</recordid><startdate>2023</startdate><enddate>2023</enddate><creator>Wang, Depeng</creator><creator>Hu, Zhenzhen</creator><creator>Zhou, Yuanen</creator><creator>Hong, Richang</creator><creator>Wang, Meng</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-5461-3986</orcidid><orcidid>https://orcid.org/0000-0002-3094-7735</orcidid><orcidid>https://orcid.org/0000-0003-1042-8361</orcidid><orcidid>https://orcid.org/0000-0002-4986-3611</orcidid><orcidid>https://orcid.org/0000-0001-6786-0732</orcidid></search><sort><creationdate>2023</creationdate><title>A Text-Guided Generation and Refinement Model for Image Captioning</title><author>Wang, Depeng ; Hu, Zhenzhen ; Zhou, Yuanen ; Hong, Richang ; Wang, Meng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c291t-6f4afa1f79e1543729a0caca56aa1baa86255ddb0ca13dfb7166ef9c4909ec1f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Attention mechanism</topic><topic>Coders</topic><topic>Cognition</topic><topic>Decoding</topic><topic>generating and refining decoder</topic><topic>Generators</topic><topic>image captioning</topic><topic>Image quality</topic><topic>Modules</topic><topic>Refining</topic><topic>Salience</topic><topic>Semantics</topic><topic>Sports</topic><topic>Sports equipment</topic><topic>text-guided</topic><topic>Training</topic><topic>Vision</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wang, Depeng</creatorcontrib><creatorcontrib>Hu, Zhenzhen</creatorcontrib><creatorcontrib>Zhou, Yuanen</creatorcontrib><creatorcontrib>Hong, Richang</creatorcontrib><creatorcontrib>Wang, Meng</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on multimedia</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wang, Depeng</au><au>Hu, Zhenzhen</au><au>Zhou, Yuanen</au><au>Hong, Richang</au><au>Wang, Meng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Text-Guided Generation and Refinement Model for Image Captioning</atitle><jtitle>IEEE transactions on multimedia</jtitle><stitle>TMM</stitle><date>2023</date><risdate>2023</risdate><volume>25</volume><spage>2966</spage><epage>2977</epage><pages>2966-2977</pages><issn>1520-9210</issn><eissn>1941-0077</eissn><coden>ITMUF8</coden><abstract>A high-quality image description requires not only the logic and fluency of language but also the richness and accuracy ofcontent. However, due to the semantic gap between vision and language, most existing image captioning approaches thatdirectly learn the cross-modal mapping from vision to language are difficult to meet these two requirements simultaneously. Inspired by the progressive learning mechanism, we trace the "generating + refining" route and propose a novel Text-GuidedGeneration and Refinement (dubbed as TGGAR) model with assistance from the guide text to improve the quality of captions.The guide text is selected from the training set according to content similarity, then utilized to explore salient objects andextend candidate words. Specifically, we follow the encoderdecoder architecture, and design a Text-Guided Relation Encoder(TGRE) to learn the visual representation that is more consistent with human visual cognition. Besides, we divide the decoderpart into two sub-modules: a Generator for the primary sentence generation and a Refiner for the sentence refinement.Generator, consisting of a standard LSTM and a Gate on Attention (GOA) module, aims to generate the primary sentencelogically and fluently. Refiner contains a caption encoder module, an attentionbased LSTM and a GOA module, whichiteratively modifies the details in the primary caption to make captions rich and accurate. Extensive experiments on theMSCOCO captioning dataset demonstrate our framework with fewer parameters remains comparable to transformer-basedmethods, and achieves state-of-the-art performance compared with other relevant approaches.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TMM.2022.3154149</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0001-5461-3986</orcidid><orcidid>https://orcid.org/0000-0002-3094-7735</orcidid><orcidid>https://orcid.org/0000-0003-1042-8361</orcidid><orcidid>https://orcid.org/0000-0002-4986-3611</orcidid><orcidid>https://orcid.org/0000-0001-6786-0732</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1520-9210
ispartof IEEE transactions on multimedia, 2023, Vol.25, p.2966-2977
issn 1520-9210
1941-0077
language eng
recordid cdi_proquest_journals_2847965769
source IEEE Electronic Library (IEL)
subjects Attention mechanism
Coders
Cognition
Decoding
generating and refining decoder
Generators
image captioning
Image quality
Modules
Refining
Salience
Semantics
Sports
Sports equipment
text-guided
Training
Vision
Visualization
title A Text-Guided Generation and Refinement Model for Image Captioning
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T21%3A54%3A30IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Text-Guided%20Generation%20and%20Refinement%20Model%20for%20Image%20Captioning&rft.jtitle=IEEE%20transactions%20on%20multimedia&rft.au=Wang,%20Depeng&rft.date=2023&rft.volume=25&rft.spage=2966&rft.epage=2977&rft.pages=2966-2977&rft.issn=1520-9210&rft.eissn=1941-0077&rft.coden=ITMUF8&rft_id=info:doi/10.1109/TMM.2022.3154149&rft_dat=%3Cproquest_RIE%3E2847965769%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2847965769&rft_id=info:pmid/&rft_ieee_id=9720933&rfr_iscdi=true