A Text-Guided Generation and Refinement Model for Image Captioning
A high-quality image description requires not only the logic and fluency of language but also the richness and accuracy ofcontent. However, due to the semantic gap between vision and language, most existing image captioning approaches thatdirectly learn the cross-modal mapping from vision to languag...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on multimedia 2023, Vol.25, p.2966-2977 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 2977 |
---|---|
container_issue | |
container_start_page | 2966 |
container_title | IEEE transactions on multimedia |
container_volume | 25 |
creator | Wang, Depeng Hu, Zhenzhen Zhou, Yuanen Hong, Richang Wang, Meng |
description | A high-quality image description requires not only the logic and fluency of language but also the richness and accuracy ofcontent. However, due to the semantic gap between vision and language, most existing image captioning approaches thatdirectly learn the cross-modal mapping from vision to language are difficult to meet these two requirements simultaneously. Inspired by the progressive learning mechanism, we trace the "generating + refining" route and propose a novel Text-GuidedGeneration and Refinement (dubbed as TGGAR) model with assistance from the guide text to improve the quality of captions.The guide text is selected from the training set according to content similarity, then utilized to explore salient objects andextend candidate words. Specifically, we follow the encoderdecoder architecture, and design a Text-Guided Relation Encoder(TGRE) to learn the visual representation that is more consistent with human visual cognition. Besides, we divide the decoderpart into two sub-modules: a Generator for the primary sentence generation and a Refiner for the sentence refinement.Generator, consisting of a standard LSTM and a Gate on Attention (GOA) module, aims to generate the primary sentencelogically and fluently. Refiner contains a caption encoder module, an attentionbased LSTM and a GOA module, whichiteratively modifies the details in the primary caption to make captions rich and accurate. Extensive experiments on theMSCOCO captioning dataset demonstrate our framework with fewer parameters remains comparable to transformer-basedmethods, and achieves state-of-the-art performance compared with other relevant approaches. |
doi_str_mv | 10.1109/TMM.2022.3154149 |
format | Article |
fullrecord | <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2847965769</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9720933</ieee_id><sourcerecordid>2847965769</sourcerecordid><originalsourceid>FETCH-LOGICAL-c291t-6f4afa1f79e1543729a0caca56aa1baa86255ddb0ca13dfb7166ef9c4909ec1f3</originalsourceid><addsrcrecordid>eNo9kE1LAzEQhoMoWD_ugpeA560zSXbTHGvRWmgRpJ7DdHdStrS7NbsF_femtHiaYXjemeER4gFhiAjueblYDBUoNdSYGzTuQgzQGcwArL1Mfa4gcwrhWtx03QYATQ52IF7Gcsk_fTY91BVXcsoNR-rrtpHUVPKTQ93wjpteLtqKtzK0Uc52tGY5of0Rq5v1nbgKtO34_lxvxdfb63Lyns0_prPJeJ6VymGfFcFQIAzWcfpQW-UISiopL4hwRTQqVJ5X1SoNUVdhZbEoOLjSOHBcYtC34um0dx_b7wN3vd-0h9ikk16NjHVFbguXKDhRZWy7LnLw-1jvKP56BH805ZMpfzTlz6ZS5PEUqZn5H3dWgdNa_wFFcGPd</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2847965769</pqid></control><display><type>article</type><title>A Text-Guided Generation and Refinement Model for Image Captioning</title><source>IEEE Electronic Library (IEL)</source><creator>Wang, Depeng ; Hu, Zhenzhen ; Zhou, Yuanen ; Hong, Richang ; Wang, Meng</creator><creatorcontrib>Wang, Depeng ; Hu, Zhenzhen ; Zhou, Yuanen ; Hong, Richang ; Wang, Meng</creatorcontrib><description>A high-quality image description requires not only the logic and fluency of language but also the richness and accuracy ofcontent. However, due to the semantic gap between vision and language, most existing image captioning approaches thatdirectly learn the cross-modal mapping from vision to language are difficult to meet these two requirements simultaneously. Inspired by the progressive learning mechanism, we trace the "generating + refining" route and propose a novel Text-GuidedGeneration and Refinement (dubbed as TGGAR) model with assistance from the guide text to improve the quality of captions.The guide text is selected from the training set according to content similarity, then utilized to explore salient objects andextend candidate words. Specifically, we follow the encoderdecoder architecture, and design a Text-Guided Relation Encoder(TGRE) to learn the visual representation that is more consistent with human visual cognition. Besides, we divide the decoderpart into two sub-modules: a Generator for the primary sentence generation and a Refiner for the sentence refinement.Generator, consisting of a standard LSTM and a Gate on Attention (GOA) module, aims to generate the primary sentencelogically and fluently. Refiner contains a caption encoder module, an attentionbased LSTM and a GOA module, whichiteratively modifies the details in the primary caption to make captions rich and accurate. Extensive experiments on theMSCOCO captioning dataset demonstrate our framework with fewer parameters remains comparable to transformer-basedmethods, and achieves state-of-the-art performance compared with other relevant approaches.</description><identifier>ISSN: 1520-9210</identifier><identifier>EISSN: 1941-0077</identifier><identifier>DOI: 10.1109/TMM.2022.3154149</identifier><identifier>CODEN: ITMUF8</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Attention mechanism ; Coders ; Cognition ; Decoding ; generating and refining decoder ; Generators ; image captioning ; Image quality ; Modules ; Refining ; Salience ; Semantics ; Sports ; Sports equipment ; text-guided ; Training ; Vision ; Visualization</subject><ispartof>IEEE transactions on multimedia, 2023, Vol.25, p.2966-2977</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c291t-6f4afa1f79e1543729a0caca56aa1baa86255ddb0ca13dfb7166ef9c4909ec1f3</citedby><cites>FETCH-LOGICAL-c291t-6f4afa1f79e1543729a0caca56aa1baa86255ddb0ca13dfb7166ef9c4909ec1f3</cites><orcidid>0000-0001-5461-3986 ; 0000-0002-3094-7735 ; 0000-0003-1042-8361 ; 0000-0002-4986-3611 ; 0000-0001-6786-0732</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9720933$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,4010,27900,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9720933$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Wang, Depeng</creatorcontrib><creatorcontrib>Hu, Zhenzhen</creatorcontrib><creatorcontrib>Zhou, Yuanen</creatorcontrib><creatorcontrib>Hong, Richang</creatorcontrib><creatorcontrib>Wang, Meng</creatorcontrib><title>A Text-Guided Generation and Refinement Model for Image Captioning</title><title>IEEE transactions on multimedia</title><addtitle>TMM</addtitle><description>A high-quality image description requires not only the logic and fluency of language but also the richness and accuracy ofcontent. However, due to the semantic gap between vision and language, most existing image captioning approaches thatdirectly learn the cross-modal mapping from vision to language are difficult to meet these two requirements simultaneously. Inspired by the progressive learning mechanism, we trace the "generating + refining" route and propose a novel Text-GuidedGeneration and Refinement (dubbed as TGGAR) model with assistance from the guide text to improve the quality of captions.The guide text is selected from the training set according to content similarity, then utilized to explore salient objects andextend candidate words. Specifically, we follow the encoderdecoder architecture, and design a Text-Guided Relation Encoder(TGRE) to learn the visual representation that is more consistent with human visual cognition. Besides, we divide the decoderpart into two sub-modules: a Generator for the primary sentence generation and a Refiner for the sentence refinement.Generator, consisting of a standard LSTM and a Gate on Attention (GOA) module, aims to generate the primary sentencelogically and fluently. Refiner contains a caption encoder module, an attentionbased LSTM and a GOA module, whichiteratively modifies the details in the primary caption to make captions rich and accurate. Extensive experiments on theMSCOCO captioning dataset demonstrate our framework with fewer parameters remains comparable to transformer-basedmethods, and achieves state-of-the-art performance compared with other relevant approaches.</description><subject>Attention mechanism</subject><subject>Coders</subject><subject>Cognition</subject><subject>Decoding</subject><subject>generating and refining decoder</subject><subject>Generators</subject><subject>image captioning</subject><subject>Image quality</subject><subject>Modules</subject><subject>Refining</subject><subject>Salience</subject><subject>Semantics</subject><subject>Sports</subject><subject>Sports equipment</subject><subject>text-guided</subject><subject>Training</subject><subject>Vision</subject><subject>Visualization</subject><issn>1520-9210</issn><issn>1941-0077</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kE1LAzEQhoMoWD_ugpeA560zSXbTHGvRWmgRpJ7DdHdStrS7NbsF_femtHiaYXjemeER4gFhiAjueblYDBUoNdSYGzTuQgzQGcwArL1Mfa4gcwrhWtx03QYATQ52IF7Gcsk_fTY91BVXcsoNR-rrtpHUVPKTQ93wjpteLtqKtzK0Uc52tGY5of0Rq5v1nbgKtO34_lxvxdfb63Lyns0_prPJeJ6VymGfFcFQIAzWcfpQW-UISiopL4hwRTQqVJ5X1SoNUVdhZbEoOLjSOHBcYtC34um0dx_b7wN3vd-0h9ikk16NjHVFbguXKDhRZWy7LnLw-1jvKP56BH805ZMpfzTlz6ZS5PEUqZn5H3dWgdNa_wFFcGPd</recordid><startdate>2023</startdate><enddate>2023</enddate><creator>Wang, Depeng</creator><creator>Hu, Zhenzhen</creator><creator>Zhou, Yuanen</creator><creator>Hong, Richang</creator><creator>Wang, Meng</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-5461-3986</orcidid><orcidid>https://orcid.org/0000-0002-3094-7735</orcidid><orcidid>https://orcid.org/0000-0003-1042-8361</orcidid><orcidid>https://orcid.org/0000-0002-4986-3611</orcidid><orcidid>https://orcid.org/0000-0001-6786-0732</orcidid></search><sort><creationdate>2023</creationdate><title>A Text-Guided Generation and Refinement Model for Image Captioning</title><author>Wang, Depeng ; Hu, Zhenzhen ; Zhou, Yuanen ; Hong, Richang ; Wang, Meng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c291t-6f4afa1f79e1543729a0caca56aa1baa86255ddb0ca13dfb7166ef9c4909ec1f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Attention mechanism</topic><topic>Coders</topic><topic>Cognition</topic><topic>Decoding</topic><topic>generating and refining decoder</topic><topic>Generators</topic><topic>image captioning</topic><topic>Image quality</topic><topic>Modules</topic><topic>Refining</topic><topic>Salience</topic><topic>Semantics</topic><topic>Sports</topic><topic>Sports equipment</topic><topic>text-guided</topic><topic>Training</topic><topic>Vision</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wang, Depeng</creatorcontrib><creatorcontrib>Hu, Zhenzhen</creatorcontrib><creatorcontrib>Zhou, Yuanen</creatorcontrib><creatorcontrib>Hong, Richang</creatorcontrib><creatorcontrib>Wang, Meng</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on multimedia</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wang, Depeng</au><au>Hu, Zhenzhen</au><au>Zhou, Yuanen</au><au>Hong, Richang</au><au>Wang, Meng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Text-Guided Generation and Refinement Model for Image Captioning</atitle><jtitle>IEEE transactions on multimedia</jtitle><stitle>TMM</stitle><date>2023</date><risdate>2023</risdate><volume>25</volume><spage>2966</spage><epage>2977</epage><pages>2966-2977</pages><issn>1520-9210</issn><eissn>1941-0077</eissn><coden>ITMUF8</coden><abstract>A high-quality image description requires not only the logic and fluency of language but also the richness and accuracy ofcontent. However, due to the semantic gap between vision and language, most existing image captioning approaches thatdirectly learn the cross-modal mapping from vision to language are difficult to meet these two requirements simultaneously. Inspired by the progressive learning mechanism, we trace the "generating + refining" route and propose a novel Text-GuidedGeneration and Refinement (dubbed as TGGAR) model with assistance from the guide text to improve the quality of captions.The guide text is selected from the training set according to content similarity, then utilized to explore salient objects andextend candidate words. Specifically, we follow the encoderdecoder architecture, and design a Text-Guided Relation Encoder(TGRE) to learn the visual representation that is more consistent with human visual cognition. Besides, we divide the decoderpart into two sub-modules: a Generator for the primary sentence generation and a Refiner for the sentence refinement.Generator, consisting of a standard LSTM and a Gate on Attention (GOA) module, aims to generate the primary sentencelogically and fluently. Refiner contains a caption encoder module, an attentionbased LSTM and a GOA module, whichiteratively modifies the details in the primary caption to make captions rich and accurate. Extensive experiments on theMSCOCO captioning dataset demonstrate our framework with fewer parameters remains comparable to transformer-basedmethods, and achieves state-of-the-art performance compared with other relevant approaches.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TMM.2022.3154149</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0001-5461-3986</orcidid><orcidid>https://orcid.org/0000-0002-3094-7735</orcidid><orcidid>https://orcid.org/0000-0003-1042-8361</orcidid><orcidid>https://orcid.org/0000-0002-4986-3611</orcidid><orcidid>https://orcid.org/0000-0001-6786-0732</orcidid></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 1520-9210 |
ispartof | IEEE transactions on multimedia, 2023, Vol.25, p.2966-2977 |
issn | 1520-9210 1941-0077 |
language | eng |
recordid | cdi_proquest_journals_2847965769 |
source | IEEE Electronic Library (IEL) |
subjects | Attention mechanism Coders Cognition Decoding generating and refining decoder Generators image captioning Image quality Modules Refining Salience Semantics Sports Sports equipment text-guided Training Vision Visualization |
title | A Text-Guided Generation and Refinement Model for Image Captioning |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T21%3A54%3A30IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Text-Guided%20Generation%20and%20Refinement%20Model%20for%20Image%20Captioning&rft.jtitle=IEEE%20transactions%20on%20multimedia&rft.au=Wang,%20Depeng&rft.date=2023&rft.volume=25&rft.spage=2966&rft.epage=2977&rft.pages=2966-2977&rft.issn=1520-9210&rft.eissn=1941-0077&rft.coden=ITMUF8&rft_id=info:doi/10.1109/TMM.2022.3154149&rft_dat=%3Cproquest_RIE%3E2847965769%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2847965769&rft_id=info:pmid/&rft_ieee_id=9720933&rfr_iscdi=true |