Concept Parser With Multimodal Graph Learning for Video Captioning

Conventional video captioning methods are either stage-wise or simple end-to-end. While the former might introduce additional noise when exploiting off-the-shelf models to provide extra information, the latter suffers from lacking high-level cues. Therefore, a more desired framework should be able t...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on circuits and systems for video technology 2023-09, Vol.33 (9), p.4484-4495
Hauptverfasser:	Wu, Bofeng, Liu, Buyu, Huang, Peng, Bao, Jun, Xi, Peng, Yu, Jun
Format:	Artikel
Sprache:	eng
Schlagworte:	Learning Parsers Transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	4495
container_issue	9
container_start_page	4484
container_title	IEEE transactions on circuits and systems for video technology
container_volume	33
creator	Wu, Bofeng Liu, Buyu Huang, Peng Bao, Jun Xi, Peng Yu, Jun
description	Conventional video captioning methods are either stage-wise or simple end-to-end. While the former might introduce additional noise when exploiting off-the-shelf models to provide extra information, the latter suffers from lacking high-level cues. Therefore, a more desired framework should be able to capture multi-aspects of videos consistently. To this end, we present a concept-aware and task-specific model named CAT that accounts for both low-level visual and high-level concept cues, and incorporates them effectively in an end-to-end manner. Specifically, low-level visual and high-level concept features are obtained from the video transformer and concept parser of CAT. And a concept loss is further introduced to regularize the learning process of concept parser w.r.t. generated pseudo ground truth. To combine multi-level features, a caption transformer is later introduced in CAT, where visual and concept features are the inputs and caption is its output. In particular, we make critical design choices in the caption transformer to learn to exploit these cues with a multi-modal graph. This is achieved by a graph loss that enforces effective learning of intra and inter correlations between multi-level cues. Extensive experiments on three benchmark datasets demonstrate that CAT achieves 2.3 and 0.7 improvements in the CIDEr metric on MSVD and MSR-VTT compared to the state-of-the-art method SwinBERT and also achieves a competitive result on VATEX.
doi_str_mv	10.1109/TCSVT.2023.3277827
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2861467853</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2861467853</sourcerecordid><originalsourceid>FETCH-LOGICAL-c275t-2746d559fc74b91a9b352b75a4642bb79309e28dc9307b5097752758fb19adb83</originalsourceid><addsrcrecordid>eNotkE9LwzAYxoMoOKdfwFPAc2fyJmmSoxadwkTBOY8haVPXsTU1aQ_79nZup_fh5fkDP4RuKZlRSvT9svhcLWdAgM0YSKlAnqEJFUJlAEScj5oImimg4hJdpbQhhHLF5QQ9FqEtfdfjDxuTj_i76df4bdj2zS5Udovn0XZrvPA2tk37g-sQ8aqpfMCF7fomHJ7X6KK22-RvTneKvp6flsVLtnifvxYPi6wEKfoMJM8rIXRdSu40tdoxAU4Ky3MOzknNiPagqnIU0gmipRRjUNWOals5xabo7tjbxfA7-NSbTRhiO04aUDnluVSCjS44usoYUoq-Nl1sdjbuDSXmwMr8szIHVubEiv0BRphbIQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2861467853</pqid></control><display><type>article</type><title>Concept Parser With Multimodal Graph Learning for Video Captioning</title><source>IEEE Electronic Library (IEL)</source><creator>Wu, Bofeng ; Liu, Buyu ; Huang, Peng ; Bao, Jun ; Xi, Peng ; Yu, Jun</creator><creatorcontrib>Wu, Bofeng ; Liu, Buyu ; Huang, Peng ; Bao, Jun ; Xi, Peng ; Yu, Jun</creatorcontrib><description>Conventional video captioning methods are either stage-wise or simple end-to-end. While the former might introduce additional noise when exploiting off-the-shelf models to provide extra information, the latter suffers from lacking high-level cues. Therefore, a more desired framework should be able to capture multi-aspects of videos consistently. To this end, we present a concept-aware and task-specific model named CAT that accounts for both low-level visual and high-level concept cues, and incorporates them effectively in an end-to-end manner. Specifically, low-level visual and high-level concept features are obtained from the video transformer and concept parser of CAT. And a concept loss is further introduced to regularize the learning process of concept parser w.r.t. generated pseudo ground truth. To combine multi-level features, a caption transformer is later introduced in CAT, where visual and concept features are the inputs and caption is its output. In particular, we make critical design choices in the caption transformer to learn to exploit these cues with a multi-modal graph. This is achieved by a graph loss that enforces effective learning of intra and inter correlations between multi-level cues. Extensive experiments on three benchmark datasets demonstrate that CAT achieves 2.3 and 0.7 improvements in the CIDEr metric on MSVD and MSR-VTT compared to the state-of-the-art method SwinBERT and also achieves a competitive result on VATEX.</description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2023.3277827</identifier><language>eng</language><publisher>New York: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</publisher><subject>Learning ; Parsers ; Transformers</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2023-09, Vol.33 (9), p.4484-4495</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c275t-2746d559fc74b91a9b352b75a4642bb79309e28dc9307b5097752758fb19adb83</citedby><cites>FETCH-LOGICAL-c275t-2746d559fc74b91a9b352b75a4642bb79309e28dc9307b5097752758fb19adb83</cites><orcidid>0000-0003-1922-7283 ; 0000-0002-4539-4854 ; 0000-0002-5727-2790</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Wu, Bofeng</creatorcontrib><creatorcontrib>Liu, Buyu</creatorcontrib><creatorcontrib>Huang, Peng</creatorcontrib><creatorcontrib>Bao, Jun</creatorcontrib><creatorcontrib>Xi, Peng</creatorcontrib><creatorcontrib>Yu, Jun</creatorcontrib><title>Concept Parser With Multimodal Graph Learning for Video Captioning</title><title>IEEE transactions on circuits and systems for video technology</title><description>Conventional video captioning methods are either stage-wise or simple end-to-end. While the former might introduce additional noise when exploiting off-the-shelf models to provide extra information, the latter suffers from lacking high-level cues. Therefore, a more desired framework should be able to capture multi-aspects of videos consistently. To this end, we present a concept-aware and task-specific model named CAT that accounts for both low-level visual and high-level concept cues, and incorporates them effectively in an end-to-end manner. Specifically, low-level visual and high-level concept features are obtained from the video transformer and concept parser of CAT. And a concept loss is further introduced to regularize the learning process of concept parser w.r.t. generated pseudo ground truth. To combine multi-level features, a caption transformer is later introduced in CAT, where visual and concept features are the inputs and caption is its output. In particular, we make critical design choices in the caption transformer to learn to exploit these cues with a multi-modal graph. This is achieved by a graph loss that enforces effective learning of intra and inter correlations between multi-level cues. Extensive experiments on three benchmark datasets demonstrate that CAT achieves 2.3 and 0.7 improvements in the CIDEr metric on MSVD and MSR-VTT compared to the state-of-the-art method SwinBERT and also achieves a competitive result on VATEX.</description><subject>Learning</subject><subject>Parsers</subject><subject>Transformers</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNotkE9LwzAYxoMoOKdfwFPAc2fyJmmSoxadwkTBOY8haVPXsTU1aQ_79nZup_fh5fkDP4RuKZlRSvT9svhcLWdAgM0YSKlAnqEJFUJlAEScj5oImimg4hJdpbQhhHLF5QQ9FqEtfdfjDxuTj_i76df4bdj2zS5Udovn0XZrvPA2tk37g-sQ8aqpfMCF7fomHJ7X6KK22-RvTneKvp6flsVLtnifvxYPi6wEKfoMJM8rIXRdSu40tdoxAU4Ky3MOzknNiPagqnIU0gmipRRjUNWOals5xabo7tjbxfA7-NSbTRhiO04aUDnluVSCjS44usoYUoq-Nl1sdjbuDSXmwMr8szIHVubEiv0BRphbIQ</recordid><startdate>20230901</startdate><enddate>20230901</enddate><creator>Wu, Bofeng</creator><creator>Liu, Buyu</creator><creator>Huang, Peng</creator><creator>Bao, Jun</creator><creator>Xi, Peng</creator><creator>Yu, Jun</creator><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0003-1922-7283</orcidid><orcidid>https://orcid.org/0000-0002-4539-4854</orcidid><orcidid>https://orcid.org/0000-0002-5727-2790</orcidid></search><sort><creationdate>20230901</creationdate><title>Concept Parser With Multimodal Graph Learning for Video Captioning</title><author>Wu, Bofeng ; Liu, Buyu ; Huang, Peng ; Bao, Jun ; Xi, Peng ; Yu, Jun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c275t-2746d559fc74b91a9b352b75a4642bb79309e28dc9307b5097752758fb19adb83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Learning</topic><topic>Parsers</topic><topic>Transformers</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wu, Bofeng</creatorcontrib><creatorcontrib>Liu, Buyu</creatorcontrib><creatorcontrib>Huang, Peng</creatorcontrib><creatorcontrib>Bao, Jun</creatorcontrib><creatorcontrib>Xi, Peng</creatorcontrib><creatorcontrib>Yu, Jun</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wu, Bofeng</au><au>Liu, Buyu</au><au>Huang, Peng</au><au>Bao, Jun</au><au>Xi, Peng</au><au>Yu, Jun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Concept Parser With Multimodal Graph Learning for Video Captioning</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><date>2023-09-01</date><risdate>2023</risdate><volume>33</volume><issue>9</issue><spage>4484</spage><epage>4495</epage><pages>4484-4495</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><abstract>Conventional video captioning methods are either stage-wise or simple end-to-end. While the former might introduce additional noise when exploiting off-the-shelf models to provide extra information, the latter suffers from lacking high-level cues. Therefore, a more desired framework should be able to capture multi-aspects of videos consistently. To this end, we present a concept-aware and task-specific model named CAT that accounts for both low-level visual and high-level concept cues, and incorporates them effectively in an end-to-end manner. Specifically, low-level visual and high-level concept features are obtained from the video transformer and concept parser of CAT. And a concept loss is further introduced to regularize the learning process of concept parser w.r.t. generated pseudo ground truth. To combine multi-level features, a caption transformer is later introduced in CAT, where visual and concept features are the inputs and caption is its output. In particular, we make critical design choices in the caption transformer to learn to exploit these cues with a multi-modal graph. This is achieved by a graph loss that enforces effective learning of intra and inter correlations between multi-level cues. Extensive experiments on three benchmark datasets demonstrate that CAT achieves 2.3 and 0.7 improvements in the CIDEr metric on MSVD and MSR-VTT compared to the state-of-the-art method SwinBERT and also achieves a competitive result on VATEX.</abstract><cop>New York</cop><pub>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</pub><doi>10.1109/TCSVT.2023.3277827</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0003-1922-7283</orcidid><orcidid>https://orcid.org/0000-0002-4539-4854</orcidid><orcidid>https://orcid.org/0000-0002-5727-2790</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 1051-8215
ispartof	IEEE transactions on circuits and systems for video technology, 2023-09, Vol.33 (9), p.4484-4495
issn	1051-8215 1558-2205
language	eng
recordid	cdi_proquest_journals_2861467853
source	IEEE Electronic Library (IEL)
subjects	Learning Parsers Transformers
title	Concept Parser With Multimodal Graph Learning for Video Captioning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-08T01%3A16%3A04IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Concept%20Parser%20With%20Multimodal%20Graph%20Learning%20for%20Video%20Captioning&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Wu,%20Bofeng&rft.date=2023-09-01&rft.volume=33&rft.issue=9&rft.spage=4484&rft.epage=4495&rft.pages=4484-4495&rft.issn=1051-8215&rft.eissn=1558-2205&rft_id=info:doi/10.1109/TCSVT.2023.3277827&rft_dat=%3Cproquest_cross%3E2861467853%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2861467853&rft_id=info:pmid/&rfr_iscdi=true