Semantic Enhanced Video Captioning with Multi-feature Fusion

Video captioning aims to automatically describe a video clip with informative sentences. At present, deep learning-based models have become the mainstream for this task and achieved competitive results on public datasets. Usually, these methods leverage different types of features to generate senten...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:ACM transactions on multimedia computing communications and applications 2023-11, Vol.19 (6), p.1-21
Hauptverfasser: Niu, Tian-Zi, Dong, Shan-Shan, Chen, Zhen-Duo, Luo, Xin, Guo, Shanqing, Huang, Zi, Xu, Xin-Shun
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 21
container_issue 6
container_start_page 1
container_title ACM transactions on multimedia computing communications and applications
container_volume 19
creator Niu, Tian-Zi
Dong, Shan-Shan
Chen, Zhen-Duo
Luo, Xin
Guo, Shanqing
Huang, Zi
Xu, Xin-Shun
description Video captioning aims to automatically describe a video clip with informative sentences. At present, deep learning-based models have become the mainstream for this task and achieved competitive results on public datasets. Usually, these methods leverage different types of features to generate sentences, e.g. semantic information, 2D or 3D features. However, some methods only treat semantic information as a complement of visual representations, and cannot fully exploit it; some of them ignore the relationship between different types of features. In addition, most of them select multiple frames of a video with an equally spaced sampling scheme, resulting in much redundant information. To address these issues, we present a novel video captioning framework, Semantic Enhanced video captioning with Multi-feature Fusion, SEMF for short. It optimizes the use of different types of features from three aspects. First of all, a semantic encoder is designed to enhance meaningful semantic features through a semantic dictionary to boost performance. Secondly, a discrete selection module pays attention to important features and obtains different contexts at different steps to reduce feature redundancy. Finally, a multi-feature fusion module uses a novel relation-aware attention mechanism to separate the common and complementary components of different features to provide more effective visual features for the next step. Moreover, the entire framework can be trained in an end-to-end manner. Extensive experiments are conducted on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSR-VTT) datasets. The results demonstrate that SEMF is able to achieve state-of-the-art results.
doi_str_mv 10.1145/3588572
format Article
fullrecord <record><control><sourceid>acm_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3588572</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3588572</sourcerecordid><originalsourceid>FETCH-LOGICAL-a277t-719b138a197e8b8fa930bcda0d216778826173c2d8e1d2b94f74fde09bb9bca13</originalsourceid><addsrcrecordid>eNo9j0tLAzEYRYMoWKu4d5Wdq9F8yeQFbmRoVai48LEd8rSRzkyZpIj_3kprV_fCPVw4CF0CuQGo-S3jSnFJj9AEOIdKKMGPD53LU3SW8xchTPBaTNDda-hMX5LDs35pehc8_kg-DLgx65KGPvWf-DuVJX7erEqqYjBlMwY83-TteI5OolnlcLHPKXqfz96ax2rx8vDU3C8qQ6UslQRtgSkDWgZlVTSaEeu8IZ6CkFIpKkAyR70K4KnVdZR19IFoa7V1BtgUXe9-3TjkPIbYrsfUmfGnBdL-Sbd76S15tSON6w7Q__gLTxVRAg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Semantic Enhanced Video Captioning with Multi-feature Fusion</title><source>ACM Digital Library Complete</source><creator>Niu, Tian-Zi ; Dong, Shan-Shan ; Chen, Zhen-Duo ; Luo, Xin ; Guo, Shanqing ; Huang, Zi ; Xu, Xin-Shun</creator><creatorcontrib>Niu, Tian-Zi ; Dong, Shan-Shan ; Chen, Zhen-Duo ; Luo, Xin ; Guo, Shanqing ; Huang, Zi ; Xu, Xin-Shun</creatorcontrib><description>Video captioning aims to automatically describe a video clip with informative sentences. At present, deep learning-based models have become the mainstream for this task and achieved competitive results on public datasets. Usually, these methods leverage different types of features to generate sentences, e.g. semantic information, 2D or 3D features. However, some methods only treat semantic information as a complement of visual representations, and cannot fully exploit it; some of them ignore the relationship between different types of features. In addition, most of them select multiple frames of a video with an equally spaced sampling scheme, resulting in much redundant information. To address these issues, we present a novel video captioning framework, Semantic Enhanced video captioning with Multi-feature Fusion, SEMF for short. It optimizes the use of different types of features from three aspects. First of all, a semantic encoder is designed to enhance meaningful semantic features through a semantic dictionary to boost performance. Secondly, a discrete selection module pays attention to important features and obtains different contexts at different steps to reduce feature redundancy. Finally, a multi-feature fusion module uses a novel relation-aware attention mechanism to separate the common and complementary components of different features to provide more effective visual features for the next step. Moreover, the entire framework can be trained in an end-to-end manner. Extensive experiments are conducted on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSR-VTT) datasets. The results demonstrate that SEMF is able to achieve state-of-the-art results.</description><identifier>ISSN: 1551-6857</identifier><identifier>EISSN: 1551-6865</identifier><identifier>DOI: 10.1145/3588572</identifier><language>eng</language><publisher>New York, NY: ACM</publisher><subject>Computing methodologies ; Machine learning ; Neural networks ; Video summarization</subject><ispartof>ACM transactions on multimedia computing communications and applications, 2023-11, Vol.19 (6), p.1-21</ispartof><rights>Copyright held by the owner/author(s). Publication rights licensed to ACM.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a277t-719b138a197e8b8fa930bcda0d216778826173c2d8e1d2b94f74fde09bb9bca13</citedby><cites>FETCH-LOGICAL-a277t-719b138a197e8b8fa930bcda0d216778826173c2d8e1d2b94f74fde09bb9bca13</cites><orcidid>0000-0002-6901-5476 ; 0000-0003-3367-0951 ; 0000-0002-2500-9488 ; 0000-0002-9738-4949 ; 0000-0002-3481-4892 ; 0000-0001-9972-7370 ; 0000-0002-7389-5883</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://dl.acm.org/doi/pdf/10.1145/3588572$$EPDF$$P50$$Gacm$$Hfree_for_read</linktopdf><link.rule.ids>314,776,780,2276,27901,27902,40172,75971</link.rule.ids></links><search><creatorcontrib>Niu, Tian-Zi</creatorcontrib><creatorcontrib>Dong, Shan-Shan</creatorcontrib><creatorcontrib>Chen, Zhen-Duo</creatorcontrib><creatorcontrib>Luo, Xin</creatorcontrib><creatorcontrib>Guo, Shanqing</creatorcontrib><creatorcontrib>Huang, Zi</creatorcontrib><creatorcontrib>Xu, Xin-Shun</creatorcontrib><title>Semantic Enhanced Video Captioning with Multi-feature Fusion</title><title>ACM transactions on multimedia computing communications and applications</title><addtitle>ACM TOMM</addtitle><description>Video captioning aims to automatically describe a video clip with informative sentences. At present, deep learning-based models have become the mainstream for this task and achieved competitive results on public datasets. Usually, these methods leverage different types of features to generate sentences, e.g. semantic information, 2D or 3D features. However, some methods only treat semantic information as a complement of visual representations, and cannot fully exploit it; some of them ignore the relationship between different types of features. In addition, most of them select multiple frames of a video with an equally spaced sampling scheme, resulting in much redundant information. To address these issues, we present a novel video captioning framework, Semantic Enhanced video captioning with Multi-feature Fusion, SEMF for short. It optimizes the use of different types of features from three aspects. First of all, a semantic encoder is designed to enhance meaningful semantic features through a semantic dictionary to boost performance. Secondly, a discrete selection module pays attention to important features and obtains different contexts at different steps to reduce feature redundancy. Finally, a multi-feature fusion module uses a novel relation-aware attention mechanism to separate the common and complementary components of different features to provide more effective visual features for the next step. Moreover, the entire framework can be trained in an end-to-end manner. Extensive experiments are conducted on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSR-VTT) datasets. The results demonstrate that SEMF is able to achieve state-of-the-art results.</description><subject>Computing methodologies</subject><subject>Machine learning</subject><subject>Neural networks</subject><subject>Video summarization</subject><issn>1551-6857</issn><issn>1551-6865</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNo9j0tLAzEYRYMoWKu4d5Wdq9F8yeQFbmRoVai48LEd8rSRzkyZpIj_3kprV_fCPVw4CF0CuQGo-S3jSnFJj9AEOIdKKMGPD53LU3SW8xchTPBaTNDda-hMX5LDs35pehc8_kg-DLgx65KGPvWf-DuVJX7erEqqYjBlMwY83-TteI5OolnlcLHPKXqfz96ax2rx8vDU3C8qQ6UslQRtgSkDWgZlVTSaEeu8IZ6CkFIpKkAyR70K4KnVdZR19IFoa7V1BtgUXe9-3TjkPIbYrsfUmfGnBdL-Sbd76S15tSON6w7Q__gLTxVRAg</recordid><startdate>20231130</startdate><enddate>20231130</enddate><creator>Niu, Tian-Zi</creator><creator>Dong, Shan-Shan</creator><creator>Chen, Zhen-Duo</creator><creator>Luo, Xin</creator><creator>Guo, Shanqing</creator><creator>Huang, Zi</creator><creator>Xu, Xin-Shun</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-6901-5476</orcidid><orcidid>https://orcid.org/0000-0003-3367-0951</orcidid><orcidid>https://orcid.org/0000-0002-2500-9488</orcidid><orcidid>https://orcid.org/0000-0002-9738-4949</orcidid><orcidid>https://orcid.org/0000-0002-3481-4892</orcidid><orcidid>https://orcid.org/0000-0001-9972-7370</orcidid><orcidid>https://orcid.org/0000-0002-7389-5883</orcidid></search><sort><creationdate>20231130</creationdate><title>Semantic Enhanced Video Captioning with Multi-feature Fusion</title><author>Niu, Tian-Zi ; Dong, Shan-Shan ; Chen, Zhen-Duo ; Luo, Xin ; Guo, Shanqing ; Huang, Zi ; Xu, Xin-Shun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a277t-719b138a197e8b8fa930bcda0d216778826173c2d8e1d2b94f74fde09bb9bca13</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computing methodologies</topic><topic>Machine learning</topic><topic>Neural networks</topic><topic>Video summarization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Niu, Tian-Zi</creatorcontrib><creatorcontrib>Dong, Shan-Shan</creatorcontrib><creatorcontrib>Chen, Zhen-Duo</creatorcontrib><creatorcontrib>Luo, Xin</creatorcontrib><creatorcontrib>Guo, Shanqing</creatorcontrib><creatorcontrib>Huang, Zi</creatorcontrib><creatorcontrib>Xu, Xin-Shun</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on multimedia computing communications and applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Niu, Tian-Zi</au><au>Dong, Shan-Shan</au><au>Chen, Zhen-Duo</au><au>Luo, Xin</au><au>Guo, Shanqing</au><au>Huang, Zi</au><au>Xu, Xin-Shun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Semantic Enhanced Video Captioning with Multi-feature Fusion</atitle><jtitle>ACM transactions on multimedia computing communications and applications</jtitle><stitle>ACM TOMM</stitle><date>2023-11-30</date><risdate>2023</risdate><volume>19</volume><issue>6</issue><spage>1</spage><epage>21</epage><pages>1-21</pages><issn>1551-6857</issn><eissn>1551-6865</eissn><abstract>Video captioning aims to automatically describe a video clip with informative sentences. At present, deep learning-based models have become the mainstream for this task and achieved competitive results on public datasets. Usually, these methods leverage different types of features to generate sentences, e.g. semantic information, 2D or 3D features. However, some methods only treat semantic information as a complement of visual representations, and cannot fully exploit it; some of them ignore the relationship between different types of features. In addition, most of them select multiple frames of a video with an equally spaced sampling scheme, resulting in much redundant information. To address these issues, we present a novel video captioning framework, Semantic Enhanced video captioning with Multi-feature Fusion, SEMF for short. It optimizes the use of different types of features from three aspects. First of all, a semantic encoder is designed to enhance meaningful semantic features through a semantic dictionary to boost performance. Secondly, a discrete selection module pays attention to important features and obtains different contexts at different steps to reduce feature redundancy. Finally, a multi-feature fusion module uses a novel relation-aware attention mechanism to separate the common and complementary components of different features to provide more effective visual features for the next step. Moreover, the entire framework can be trained in an end-to-end manner. Extensive experiments are conducted on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSR-VTT) datasets. The results demonstrate that SEMF is able to achieve state-of-the-art results.</abstract><cop>New York, NY</cop><pub>ACM</pub><doi>10.1145/3588572</doi><tpages>21</tpages><orcidid>https://orcid.org/0000-0002-6901-5476</orcidid><orcidid>https://orcid.org/0000-0003-3367-0951</orcidid><orcidid>https://orcid.org/0000-0002-2500-9488</orcidid><orcidid>https://orcid.org/0000-0002-9738-4949</orcidid><orcidid>https://orcid.org/0000-0002-3481-4892</orcidid><orcidid>https://orcid.org/0000-0001-9972-7370</orcidid><orcidid>https://orcid.org/0000-0002-7389-5883</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1551-6857
ispartof ACM transactions on multimedia computing communications and applications, 2023-11, Vol.19 (6), p.1-21
issn 1551-6857
1551-6865
language eng
recordid cdi_crossref_primary_10_1145_3588572
source ACM Digital Library Complete
subjects Computing methodologies
Machine learning
Neural networks
Video summarization
title Semantic Enhanced Video Captioning with Multi-feature Fusion
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-30T07%3A35%3A39IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Semantic%20Enhanced%20Video%20Captioning%20with%20Multi-feature%20Fusion&rft.jtitle=ACM%20transactions%20on%20multimedia%20computing%20communications%20and%20applications&rft.au=Niu,%20Tian-Zi&rft.date=2023-11-30&rft.volume=19&rft.issue=6&rft.spage=1&rft.epage=21&rft.pages=1-21&rft.issn=1551-6857&rft.eissn=1551-6865&rft_id=info:doi/10.1145/3588572&rft_dat=%3Cacm_cross%3E3588572%3C/acm_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true