Concept Parser With Multimodal Graph Learning for Video Captioning

Conventional video captioning methods are either stage-wise or simple end-to-end. While the former might introduce additional noise when exploiting off-the-shelf models to provide extra information, the latter suffers from lacking high-level cues. Therefore, a more desired framework should be able t...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on circuits and systems for video technology 2023-09, Vol.33 (9), p.4484-4495
Hauptverfasser: Wu, Bofeng, Liu, Buyu, Huang, Peng, Bao, Jun, Xi, Peng, Yu, Jun
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 4495
container_issue 9
container_start_page 4484
container_title IEEE transactions on circuits and systems for video technology
container_volume 33
creator Wu, Bofeng
Liu, Buyu
Huang, Peng
Bao, Jun
Xi, Peng
Yu, Jun
description Conventional video captioning methods are either stage-wise or simple end-to-end. While the former might introduce additional noise when exploiting off-the-shelf models to provide extra information, the latter suffers from lacking high-level cues. Therefore, a more desired framework should be able to capture multi-aspects of videos consistently. To this end, we present a concept-aware and task-specific model named CAT that accounts for both low-level visual and high-level concept cues, and incorporates them effectively in an end-to-end manner. Specifically, low-level visual and high-level concept features are obtained from the video transformer and concept parser of CAT. And a concept loss is further introduced to regularize the learning process of concept parser w.r.t. generated pseudo ground truth. To combine multi-level features, a caption transformer is later introduced in CAT, where visual and concept features are the inputs and caption is its output. In particular, we make critical design choices in the caption transformer to learn to exploit these cues with a multi-modal graph. This is achieved by a graph loss that enforces effective learning of intra and inter correlations between multi-level cues. Extensive experiments on three benchmark datasets demonstrate that CAT achieves 2.3 and 0.7 improvements in the CIDEr metric on MSVD and MSR-VTT compared to the state-of-the-art method SwinBERT and also achieves a competitive result on VATEX.
doi_str_mv 10.1109/TCSVT.2023.3277827
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2861467853</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2861467853</sourcerecordid><originalsourceid>FETCH-LOGICAL-c275t-2746d559fc74b91a9b352b75a4642bb79309e28dc9307b5097752758fb19adb83</originalsourceid><addsrcrecordid>eNotkE9LwzAYxoMoOKdfwFPAc2fyJmmSoxadwkTBOY8haVPXsTU1aQ_79nZup_fh5fkDP4RuKZlRSvT9svhcLWdAgM0YSKlAnqEJFUJlAEScj5oImimg4hJdpbQhhHLF5QQ9FqEtfdfjDxuTj_i76df4bdj2zS5Udovn0XZrvPA2tk37g-sQ8aqpfMCF7fomHJ7X6KK22-RvTneKvp6flsVLtnifvxYPi6wEKfoMJM8rIXRdSu40tdoxAU4Ky3MOzknNiPagqnIU0gmipRRjUNWOals5xabo7tjbxfA7-NSbTRhiO04aUDnluVSCjS44usoYUoq-Nl1sdjbuDSXmwMr8szIHVubEiv0BRphbIQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2861467853</pqid></control><display><type>article</type><title>Concept Parser With Multimodal Graph Learning for Video Captioning</title><source>IEEE Electronic Library (IEL)</source><creator>Wu, Bofeng ; Liu, Buyu ; Huang, Peng ; Bao, Jun ; Xi, Peng ; Yu, Jun</creator><creatorcontrib>Wu, Bofeng ; Liu, Buyu ; Huang, Peng ; Bao, Jun ; Xi, Peng ; Yu, Jun</creatorcontrib><description>Conventional video captioning methods are either stage-wise or simple end-to-end. While the former might introduce additional noise when exploiting off-the-shelf models to provide extra information, the latter suffers from lacking high-level cues. Therefore, a more desired framework should be able to capture multi-aspects of videos consistently. To this end, we present a concept-aware and task-specific model named CAT that accounts for both low-level visual and high-level concept cues, and incorporates them effectively in an end-to-end manner. Specifically, low-level visual and high-level concept features are obtained from the video transformer and concept parser of CAT. And a concept loss is further introduced to regularize the learning process of concept parser w.r.t. generated pseudo ground truth. To combine multi-level features, a caption transformer is later introduced in CAT, where visual and concept features are the inputs and caption is its output. In particular, we make critical design choices in the caption transformer to learn to exploit these cues with a multi-modal graph. This is achieved by a graph loss that enforces effective learning of intra and inter correlations between multi-level cues. Extensive experiments on three benchmark datasets demonstrate that CAT achieves 2.3 and 0.7 improvements in the CIDEr metric on MSVD and MSR-VTT compared to the state-of-the-art method SwinBERT and also achieves a competitive result on VATEX.</description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2023.3277827</identifier><language>eng</language><publisher>New York: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</publisher><subject>Learning ; Parsers ; Transformers</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2023-09, Vol.33 (9), p.4484-4495</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c275t-2746d559fc74b91a9b352b75a4642bb79309e28dc9307b5097752758fb19adb83</citedby><cites>FETCH-LOGICAL-c275t-2746d559fc74b91a9b352b75a4642bb79309e28dc9307b5097752758fb19adb83</cites><orcidid>0000-0003-1922-7283 ; 0000-0002-4539-4854 ; 0000-0002-5727-2790</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Wu, Bofeng</creatorcontrib><creatorcontrib>Liu, Buyu</creatorcontrib><creatorcontrib>Huang, Peng</creatorcontrib><creatorcontrib>Bao, Jun</creatorcontrib><creatorcontrib>Xi, Peng</creatorcontrib><creatorcontrib>Yu, Jun</creatorcontrib><title>Concept Parser With Multimodal Graph Learning for Video Captioning</title><title>IEEE transactions on circuits and systems for video technology</title><description>Conventional video captioning methods are either stage-wise or simple end-to-end. While the former might introduce additional noise when exploiting off-the-shelf models to provide extra information, the latter suffers from lacking high-level cues. Therefore, a more desired framework should be able to capture multi-aspects of videos consistently. To this end, we present a concept-aware and task-specific model named CAT that accounts for both low-level visual and high-level concept cues, and incorporates them effectively in an end-to-end manner. Specifically, low-level visual and high-level concept features are obtained from the video transformer and concept parser of CAT. And a concept loss is further introduced to regularize the learning process of concept parser w.r.t. generated pseudo ground truth. To combine multi-level features, a caption transformer is later introduced in CAT, where visual and concept features are the inputs and caption is its output. In particular, we make critical design choices in the caption transformer to learn to exploit these cues with a multi-modal graph. This is achieved by a graph loss that enforces effective learning of intra and inter correlations between multi-level cues. Extensive experiments on three benchmark datasets demonstrate that CAT achieves 2.3 and 0.7 improvements in the CIDEr metric on MSVD and MSR-VTT compared to the state-of-the-art method SwinBERT and also achieves a competitive result on VATEX.</description><subject>Learning</subject><subject>Parsers</subject><subject>Transformers</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNotkE9LwzAYxoMoOKdfwFPAc2fyJmmSoxadwkTBOY8haVPXsTU1aQ_79nZup_fh5fkDP4RuKZlRSvT9svhcLWdAgM0YSKlAnqEJFUJlAEScj5oImimg4hJdpbQhhHLF5QQ9FqEtfdfjDxuTj_i76df4bdj2zS5Udovn0XZrvPA2tk37g-sQ8aqpfMCF7fomHJ7X6KK22-RvTneKvp6flsVLtnifvxYPi6wEKfoMJM8rIXRdSu40tdoxAU4Ky3MOzknNiPagqnIU0gmipRRjUNWOals5xabo7tjbxfA7-NSbTRhiO04aUDnluVSCjS44usoYUoq-Nl1sdjbuDSXmwMr8szIHVubEiv0BRphbIQ</recordid><startdate>20230901</startdate><enddate>20230901</enddate><creator>Wu, Bofeng</creator><creator>Liu, Buyu</creator><creator>Huang, Peng</creator><creator>Bao, Jun</creator><creator>Xi, Peng</creator><creator>Yu, Jun</creator><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0003-1922-7283</orcidid><orcidid>https://orcid.org/0000-0002-4539-4854</orcidid><orcidid>https://orcid.org/0000-0002-5727-2790</orcidid></search><sort><creationdate>20230901</creationdate><title>Concept Parser With Multimodal Graph Learning for Video Captioning</title><author>Wu, Bofeng ; Liu, Buyu ; Huang, Peng ; Bao, Jun ; Xi, Peng ; Yu, Jun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c275t-2746d559fc74b91a9b352b75a4642bb79309e28dc9307b5097752758fb19adb83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Learning</topic><topic>Parsers</topic><topic>Transformers</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wu, Bofeng</creatorcontrib><creatorcontrib>Liu, Buyu</creatorcontrib><creatorcontrib>Huang, Peng</creatorcontrib><creatorcontrib>Bao, Jun</creatorcontrib><creatorcontrib>Xi, Peng</creatorcontrib><creatorcontrib>Yu, Jun</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wu, Bofeng</au><au>Liu, Buyu</au><au>Huang, Peng</au><au>Bao, Jun</au><au>Xi, Peng</au><au>Yu, Jun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Concept Parser With Multimodal Graph Learning for Video Captioning</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><date>2023-09-01</date><risdate>2023</risdate><volume>33</volume><issue>9</issue><spage>4484</spage><epage>4495</epage><pages>4484-4495</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><abstract>Conventional video captioning methods are either stage-wise or simple end-to-end. While the former might introduce additional noise when exploiting off-the-shelf models to provide extra information, the latter suffers from lacking high-level cues. Therefore, a more desired framework should be able to capture multi-aspects of videos consistently. To this end, we present a concept-aware and task-specific model named CAT that accounts for both low-level visual and high-level concept cues, and incorporates them effectively in an end-to-end manner. Specifically, low-level visual and high-level concept features are obtained from the video transformer and concept parser of CAT. And a concept loss is further introduced to regularize the learning process of concept parser w.r.t. generated pseudo ground truth. To combine multi-level features, a caption transformer is later introduced in CAT, where visual and concept features are the inputs and caption is its output. In particular, we make critical design choices in the caption transformer to learn to exploit these cues with a multi-modal graph. This is achieved by a graph loss that enforces effective learning of intra and inter correlations between multi-level cues. Extensive experiments on three benchmark datasets demonstrate that CAT achieves 2.3 and 0.7 improvements in the CIDEr metric on MSVD and MSR-VTT compared to the state-of-the-art method SwinBERT and also achieves a competitive result on VATEX.</abstract><cop>New York</cop><pub>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</pub><doi>10.1109/TCSVT.2023.3277827</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0003-1922-7283</orcidid><orcidid>https://orcid.org/0000-0002-4539-4854</orcidid><orcidid>https://orcid.org/0000-0002-5727-2790</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 1051-8215
ispartof IEEE transactions on circuits and systems for video technology, 2023-09, Vol.33 (9), p.4484-4495
issn 1051-8215
1558-2205
language eng
recordid cdi_proquest_journals_2861467853
source IEEE Electronic Library (IEL)
subjects Learning
Parsers
Transformers
title Concept Parser With Multimodal Graph Learning for Video Captioning
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-08T01%3A16%3A04IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Concept%20Parser%20With%20Multimodal%20Graph%20Learning%20for%20Video%20Captioning&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Wu,%20Bofeng&rft.date=2023-09-01&rft.volume=33&rft.issue=9&rft.spage=4484&rft.epage=4495&rft.pages=4484-4495&rft.issn=1051-8215&rft.eissn=1558-2205&rft_id=info:doi/10.1109/TCSVT.2023.3277827&rft_dat=%3Cproquest_cross%3E2861467853%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2861467853&rft_id=info:pmid/&rfr_iscdi=true