Concept Parser With Multimodal Graph Learning for Video Captioning
Conventional video captioning methods are either stage-wise or simple end-to-end. While the former might introduce additional noise when exploiting off-the-shelf models to provide extra information, the latter suffers from lacking high-level cues. Therefore, a more desired framework should be able t...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on circuits and systems for video technology 2023-09, Vol.33 (9), p.4484-4495 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 4495 |
---|---|
container_issue | 9 |
container_start_page | 4484 |
container_title | IEEE transactions on circuits and systems for video technology |
container_volume | 33 |
creator | Wu, Bofeng Liu, Buyu Huang, Peng Bao, Jun Xi, Peng Yu, Jun |
description | Conventional video captioning methods are either stage-wise or simple end-to-end. While the former might introduce additional noise when exploiting off-the-shelf models to provide extra information, the latter suffers from lacking high-level cues. Therefore, a more desired framework should be able to capture multi-aspects of videos consistently. To this end, we present a concept-aware and task-specific model named CAT that accounts for both low-level visual and high-level concept cues, and incorporates them effectively in an end-to-end manner. Specifically, low-level visual and high-level concept features are obtained from the video transformer and concept parser of CAT. And a concept loss is further introduced to regularize the learning process of concept parser w.r.t. generated pseudo ground truth. To combine multi-level features, a caption transformer is later introduced in CAT, where visual and concept features are the inputs and caption is its output. In particular, we make critical design choices in the caption transformer to learn to exploit these cues with a multi-modal graph. This is achieved by a graph loss that enforces effective learning of intra and inter correlations between multi-level cues. Extensive experiments on three benchmark datasets demonstrate that CAT achieves 2.3 and 0.7 improvements in the CIDEr metric on MSVD and MSR-VTT compared to the state-of-the-art method SwinBERT and also achieves a competitive result on VATEX. |
doi_str_mv | 10.1109/TCSVT.2023.3277827 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2861467853</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2861467853</sourcerecordid><originalsourceid>FETCH-LOGICAL-c275t-2746d559fc74b91a9b352b75a4642bb79309e28dc9307b5097752758fb19adb83</originalsourceid><addsrcrecordid>eNotkE9LwzAYxoMoOKdfwFPAc2fyJmmSoxadwkTBOY8haVPXsTU1aQ_79nZup_fh5fkDP4RuKZlRSvT9svhcLWdAgM0YSKlAnqEJFUJlAEScj5oImimg4hJdpbQhhHLF5QQ9FqEtfdfjDxuTj_i76df4bdj2zS5Udovn0XZrvPA2tk37g-sQ8aqpfMCF7fomHJ7X6KK22-RvTneKvp6flsVLtnifvxYPi6wEKfoMJM8rIXRdSu40tdoxAU4Ky3MOzknNiPagqnIU0gmipRRjUNWOals5xabo7tjbxfA7-NSbTRhiO04aUDnluVSCjS44usoYUoq-Nl1sdjbuDSXmwMr8szIHVubEiv0BRphbIQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2861467853</pqid></control><display><type>article</type><title>Concept Parser With Multimodal Graph Learning for Video Captioning</title><source>IEEE Electronic Library (IEL)</source><creator>Wu, Bofeng ; Liu, Buyu ; Huang, Peng ; Bao, Jun ; Xi, Peng ; Yu, Jun</creator><creatorcontrib>Wu, Bofeng ; Liu, Buyu ; Huang, Peng ; Bao, Jun ; Xi, Peng ; Yu, Jun</creatorcontrib><description>Conventional video captioning methods are either stage-wise or simple end-to-end. While the former might introduce additional noise when exploiting off-the-shelf models to provide extra information, the latter suffers from lacking high-level cues. Therefore, a more desired framework should be able to capture multi-aspects of videos consistently. To this end, we present a concept-aware and task-specific model named CAT that accounts for both low-level visual and high-level concept cues, and incorporates them effectively in an end-to-end manner. Specifically, low-level visual and high-level concept features are obtained from the video transformer and concept parser of CAT. And a concept loss is further introduced to regularize the learning process of concept parser w.r.t. generated pseudo ground truth. To combine multi-level features, a caption transformer is later introduced in CAT, where visual and concept features are the inputs and caption is its output. In particular, we make critical design choices in the caption transformer to learn to exploit these cues with a multi-modal graph. This is achieved by a graph loss that enforces effective learning of intra and inter correlations between multi-level cues. Extensive experiments on three benchmark datasets demonstrate that CAT achieves 2.3 and 0.7 improvements in the CIDEr metric on MSVD and MSR-VTT compared to the state-of-the-art method SwinBERT and also achieves a competitive result on VATEX.</description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2023.3277827</identifier><language>eng</language><publisher>New York: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</publisher><subject>Learning ; Parsers ; Transformers</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2023-09, Vol.33 (9), p.4484-4495</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c275t-2746d559fc74b91a9b352b75a4642bb79309e28dc9307b5097752758fb19adb83</citedby><cites>FETCH-LOGICAL-c275t-2746d559fc74b91a9b352b75a4642bb79309e28dc9307b5097752758fb19adb83</cites><orcidid>0000-0003-1922-7283 ; 0000-0002-4539-4854 ; 0000-0002-5727-2790</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Wu, Bofeng</creatorcontrib><creatorcontrib>Liu, Buyu</creatorcontrib><creatorcontrib>Huang, Peng</creatorcontrib><creatorcontrib>Bao, Jun</creatorcontrib><creatorcontrib>Xi, Peng</creatorcontrib><creatorcontrib>Yu, Jun</creatorcontrib><title>Concept Parser With Multimodal Graph Learning for Video Captioning</title><title>IEEE transactions on circuits and systems for video technology</title><description>Conventional video captioning methods are either stage-wise or simple end-to-end. While the former might introduce additional noise when exploiting off-the-shelf models to provide extra information, the latter suffers from lacking high-level cues. Therefore, a more desired framework should be able to capture multi-aspects of videos consistently. To this end, we present a concept-aware and task-specific model named CAT that accounts for both low-level visual and high-level concept cues, and incorporates them effectively in an end-to-end manner. Specifically, low-level visual and high-level concept features are obtained from the video transformer and concept parser of CAT. And a concept loss is further introduced to regularize the learning process of concept parser w.r.t. generated pseudo ground truth. To combine multi-level features, a caption transformer is later introduced in CAT, where visual and concept features are the inputs and caption is its output. In particular, we make critical design choices in the caption transformer to learn to exploit these cues with a multi-modal graph. This is achieved by a graph loss that enforces effective learning of intra and inter correlations between multi-level cues. Extensive experiments on three benchmark datasets demonstrate that CAT achieves 2.3 and 0.7 improvements in the CIDEr metric on MSVD and MSR-VTT compared to the state-of-the-art method SwinBERT and also achieves a competitive result on VATEX.</description><subject>Learning</subject><subject>Parsers</subject><subject>Transformers</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNotkE9LwzAYxoMoOKdfwFPAc2fyJmmSoxadwkTBOY8haVPXsTU1aQ_79nZup_fh5fkDP4RuKZlRSvT9svhcLWdAgM0YSKlAnqEJFUJlAEScj5oImimg4hJdpbQhhHLF5QQ9FqEtfdfjDxuTj_i76df4bdj2zS5Udovn0XZrvPA2tk37g-sQ8aqpfMCF7fomHJ7X6KK22-RvTneKvp6flsVLtnifvxYPi6wEKfoMJM8rIXRdSu40tdoxAU4Ky3MOzknNiPagqnIU0gmipRRjUNWOals5xabo7tjbxfA7-NSbTRhiO04aUDnluVSCjS44usoYUoq-Nl1sdjbuDSXmwMr8szIHVubEiv0BRphbIQ</recordid><startdate>20230901</startdate><enddate>20230901</enddate><creator>Wu, Bofeng</creator><creator>Liu, Buyu</creator><creator>Huang, Peng</creator><creator>Bao, Jun</creator><creator>Xi, Peng</creator><creator>Yu, Jun</creator><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0003-1922-7283</orcidid><orcidid>https://orcid.org/0000-0002-4539-4854</orcidid><orcidid>https://orcid.org/0000-0002-5727-2790</orcidid></search><sort><creationdate>20230901</creationdate><title>Concept Parser With Multimodal Graph Learning for Video Captioning</title><author>Wu, Bofeng ; Liu, Buyu ; Huang, Peng ; Bao, Jun ; Xi, Peng ; Yu, Jun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c275t-2746d559fc74b91a9b352b75a4642bb79309e28dc9307b5097752758fb19adb83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Learning</topic><topic>Parsers</topic><topic>Transformers</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wu, Bofeng</creatorcontrib><creatorcontrib>Liu, Buyu</creatorcontrib><creatorcontrib>Huang, Peng</creatorcontrib><creatorcontrib>Bao, Jun</creatorcontrib><creatorcontrib>Xi, Peng</creatorcontrib><creatorcontrib>Yu, Jun</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wu, Bofeng</au><au>Liu, Buyu</au><au>Huang, Peng</au><au>Bao, Jun</au><au>Xi, Peng</au><au>Yu, Jun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Concept Parser With Multimodal Graph Learning for Video Captioning</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><date>2023-09-01</date><risdate>2023</risdate><volume>33</volume><issue>9</issue><spage>4484</spage><epage>4495</epage><pages>4484-4495</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><abstract>Conventional video captioning methods are either stage-wise or simple end-to-end. While the former might introduce additional noise when exploiting off-the-shelf models to provide extra information, the latter suffers from lacking high-level cues. Therefore, a more desired framework should be able to capture multi-aspects of videos consistently. To this end, we present a concept-aware and task-specific model named CAT that accounts for both low-level visual and high-level concept cues, and incorporates them effectively in an end-to-end manner. Specifically, low-level visual and high-level concept features are obtained from the video transformer and concept parser of CAT. And a concept loss is further introduced to regularize the learning process of concept parser w.r.t. generated pseudo ground truth. To combine multi-level features, a caption transformer is later introduced in CAT, where visual and concept features are the inputs and caption is its output. In particular, we make critical design choices in the caption transformer to learn to exploit these cues with a multi-modal graph. This is achieved by a graph loss that enforces effective learning of intra and inter correlations between multi-level cues. Extensive experiments on three benchmark datasets demonstrate that CAT achieves 2.3 and 0.7 improvements in the CIDEr metric on MSVD and MSR-VTT compared to the state-of-the-art method SwinBERT and also achieves a competitive result on VATEX.</abstract><cop>New York</cop><pub>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</pub><doi>10.1109/TCSVT.2023.3277827</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0003-1922-7283</orcidid><orcidid>https://orcid.org/0000-0002-4539-4854</orcidid><orcidid>https://orcid.org/0000-0002-5727-2790</orcidid></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1051-8215 |
ispartof | IEEE transactions on circuits and systems for video technology, 2023-09, Vol.33 (9), p.4484-4495 |
issn | 1051-8215 1558-2205 |
language | eng |
recordid | cdi_proquest_journals_2861467853 |
source | IEEE Electronic Library (IEL) |
subjects | Learning Parsers Transformers |
title | Concept Parser With Multimodal Graph Learning for Video Captioning |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-08T01%3A16%3A04IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Concept%20Parser%20With%20Multimodal%20Graph%20Learning%20for%20Video%20Captioning&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Wu,%20Bofeng&rft.date=2023-09-01&rft.volume=33&rft.issue=9&rft.spage=4484&rft.epage=4495&rft.pages=4484-4495&rft.issn=1051-8215&rft.eissn=1558-2205&rft_id=info:doi/10.1109/TCSVT.2023.3277827&rft_dat=%3Cproquest_cross%3E2861467853%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2861467853&rft_id=info:pmid/&rfr_iscdi=true |