Semantic Enhanced Video Captioning with Multi-feature Fusion

Video captioning aims to automatically describe a video clip with informative sentences. At present, deep learning-based models have become the mainstream for this task and achieved competitive results on public datasets. Usually, these methods leverage different types of features to generate senten...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ACM transactions on multimedia computing communications and applications 2023-11, Vol.19 (6), p.1-21
Hauptverfasser:	Niu, Tian-Zi, Dong, Shan-Shan, Chen, Zhen-Duo, Luo, Xin, Guo, Shanqing, Huang, Zi, Xu, Xin-Shun
Format:	Artikel
Sprache:	eng
Schlagworte:	Computing methodologies Machine learning Neural networks Video summarization
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	21
container_issue	6
container_start_page	1
container_title	ACM transactions on multimedia computing communications and applications
container_volume	19
creator	Niu, Tian-Zi Dong, Shan-Shan Chen, Zhen-Duo Luo, Xin Guo, Shanqing Huang, Zi Xu, Xin-Shun
description	Video captioning aims to automatically describe a video clip with informative sentences. At present, deep learning-based models have become the mainstream for this task and achieved competitive results on public datasets. Usually, these methods leverage different types of features to generate sentences, e.g. semantic information, 2D or 3D features. However, some methods only treat semantic information as a complement of visual representations, and cannot fully exploit it; some of them ignore the relationship between different types of features. In addition, most of them select multiple frames of a video with an equally spaced sampling scheme, resulting in much redundant information. To address these issues, we present a novel video captioning framework, Semantic Enhanced video captioning with Multi-feature Fusion, SEMF for short. It optimizes the use of different types of features from three aspects. First of all, a semantic encoder is designed to enhance meaningful semantic features through a semantic dictionary to boost performance. Secondly, a discrete selection module pays attention to important features and obtains different contexts at different steps to reduce feature redundancy. Finally, a multi-feature fusion module uses a novel relation-aware attention mechanism to separate the common and complementary components of different features to provide more effective visual features for the next step. Moreover, the entire framework can be trained in an end-to-end manner. Extensive experiments are conducted on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSR-VTT) datasets. The results demonstrate that SEMF is able to achieve state-of-the-art results.
doi_str_mv	10.1145/3588572
format	Article
fullrecord	<record><control><sourceid>acm_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3588572</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3588572</sourcerecordid><originalsourceid>FETCH-LOGICAL-a277t-719b138a197e8b8fa930bcda0d216778826173c2d8e1d2b94f74fde09bb9bca13</originalsourceid><addsrcrecordid>eNo9j0tLAzEYRYMoWKu4d5Wdq9F8yeQFbmRoVai48LEd8rSRzkyZpIj_3kprV_fCPVw4CF0CuQGo-S3jSnFJj9AEOIdKKMGPD53LU3SW8xchTPBaTNDda-hMX5LDs35pehc8_kg-DLgx65KGPvWf-DuVJX7erEqqYjBlMwY83-TteI5OolnlcLHPKXqfz96ax2rx8vDU3C8qQ6UslQRtgSkDWgZlVTSaEeu8IZ6CkFIpKkAyR70K4KnVdZR19IFoa7V1BtgUXe9-3TjkPIbYrsfUmfGnBdL-Sbd76S15tSON6w7Q__gLTxVRAg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Semantic Enhanced Video Captioning with Multi-feature Fusion</title><source>ACM Digital Library Complete</source><creator>Niu, Tian-Zi ; Dong, Shan-Shan ; Chen, Zhen-Duo ; Luo, Xin ; Guo, Shanqing ; Huang, Zi ; Xu, Xin-Shun</creator><creatorcontrib>Niu, Tian-Zi ; Dong, Shan-Shan ; Chen, Zhen-Duo ; Luo, Xin ; Guo, Shanqing ; Huang, Zi ; Xu, Xin-Shun</creatorcontrib><description>Video captioning aims to automatically describe a video clip with informative sentences. At present, deep learning-based models have become the mainstream for this task and achieved competitive results on public datasets. Usually, these methods leverage different types of features to generate sentences, e.g. semantic information, 2D or 3D features. However, some methods only treat semantic information as a complement of visual representations, and cannot fully exploit it; some of them ignore the relationship between different types of features. In addition, most of them select multiple frames of a video with an equally spaced sampling scheme, resulting in much redundant information. To address these issues, we present a novel video captioning framework, Semantic Enhanced video captioning with Multi-feature Fusion, SEMF for short. It optimizes the use of different types of features from three aspects. First of all, a semantic encoder is designed to enhance meaningful semantic features through a semantic dictionary to boost performance. Secondly, a discrete selection module pays attention to important features and obtains different contexts at different steps to reduce feature redundancy. Finally, a multi-feature fusion module uses a novel relation-aware attention mechanism to separate the common and complementary components of different features to provide more effective visual features for the next step. Moreover, the entire framework can be trained in an end-to-end manner. Extensive experiments are conducted on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSR-VTT) datasets. The results demonstrate that SEMF is able to achieve state-of-the-art results.</description><identifier>ISSN: 1551-6857</identifier><identifier>EISSN: 1551-6865</identifier><identifier>DOI: 10.1145/3588572</identifier><language>eng</language><publisher>New York, NY: ACM</publisher><subject>Computing methodologies ; Machine learning ; Neural networks ; Video summarization</subject><ispartof>ACM transactions on multimedia computing communications and applications, 2023-11, Vol.19 (6), p.1-21</ispartof><rights>Copyright held by the owner/author(s). Publication rights licensed to ACM.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a277t-719b138a197e8b8fa930bcda0d216778826173c2d8e1d2b94f74fde09bb9bca13</citedby><cites>FETCH-LOGICAL-a277t-719b138a197e8b8fa930bcda0d216778826173c2d8e1d2b94f74fde09bb9bca13</cites><orcidid>0000-0002-6901-5476 ; 0000-0003-3367-0951 ; 0000-0002-2500-9488 ; 0000-0002-9738-4949 ; 0000-0002-3481-4892 ; 0000-0001-9972-7370 ; 0000-0002-7389-5883</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://dl.acm.org/doi/pdf/10.1145/3588572$$EPDF$$P50$$Gacm$$Hfree_for_read</linktopdf><link.rule.ids>314,776,780,2276,27901,27902,40172,75971</link.rule.ids></links><search><creatorcontrib>Niu, Tian-Zi</creatorcontrib><creatorcontrib>Dong, Shan-Shan</creatorcontrib><creatorcontrib>Chen, Zhen-Duo</creatorcontrib><creatorcontrib>Luo, Xin</creatorcontrib><creatorcontrib>Guo, Shanqing</creatorcontrib><creatorcontrib>Huang, Zi</creatorcontrib><creatorcontrib>Xu, Xin-Shun</creatorcontrib><title>Semantic Enhanced Video Captioning with Multi-feature Fusion</title><title>ACM transactions on multimedia computing communications and applications</title><addtitle>ACM TOMM</addtitle><description>Video captioning aims to automatically describe a video clip with informative sentences. At present, deep learning-based models have become the mainstream for this task and achieved competitive results on public datasets. Usually, these methods leverage different types of features to generate sentences, e.g. semantic information, 2D or 3D features. However, some methods only treat semantic information as a complement of visual representations, and cannot fully exploit it; some of them ignore the relationship between different types of features. In addition, most of them select multiple frames of a video with an equally spaced sampling scheme, resulting in much redundant information. To address these issues, we present a novel video captioning framework, Semantic Enhanced video captioning with Multi-feature Fusion, SEMF for short. It optimizes the use of different types of features from three aspects. First of all, a semantic encoder is designed to enhance meaningful semantic features through a semantic dictionary to boost performance. Secondly, a discrete selection module pays attention to important features and obtains different contexts at different steps to reduce feature redundancy. Finally, a multi-feature fusion module uses a novel relation-aware attention mechanism to separate the common and complementary components of different features to provide more effective visual features for the next step. Moreover, the entire framework can be trained in an end-to-end manner. Extensive experiments are conducted on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSR-VTT) datasets. The results demonstrate that SEMF is able to achieve state-of-the-art results.</description><subject>Computing methodologies</subject><subject>Machine learning</subject><subject>Neural networks</subject><subject>Video summarization</subject><issn>1551-6857</issn><issn>1551-6865</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNo9j0tLAzEYRYMoWKu4d5Wdq9F8yeQFbmRoVai48LEd8rSRzkyZpIj_3kprV_fCPVw4CF0CuQGo-S3jSnFJj9AEOIdKKMGPD53LU3SW8xchTPBaTNDda-hMX5LDs35pehc8_kg-DLgx65KGPvWf-DuVJX7erEqqYjBlMwY83-TteI5OolnlcLHPKXqfz96ax2rx8vDU3C8qQ6UslQRtgSkDWgZlVTSaEeu8IZ6CkFIpKkAyR70K4KnVdZR19IFoa7V1BtgUXe9-3TjkPIbYrsfUmfGnBdL-Sbd76S15tSON6w7Q__gLTxVRAg</recordid><startdate>20231130</startdate><enddate>20231130</enddate><creator>Niu, Tian-Zi</creator><creator>Dong, Shan-Shan</creator><creator>Chen, Zhen-Duo</creator><creator>Luo, Xin</creator><creator>Guo, Shanqing</creator><creator>Huang, Zi</creator><creator>Xu, Xin-Shun</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-6901-5476</orcidid><orcidid>https://orcid.org/0000-0003-3367-0951</orcidid><orcidid>https://orcid.org/0000-0002-2500-9488</orcidid><orcidid>https://orcid.org/0000-0002-9738-4949</orcidid><orcidid>https://orcid.org/0000-0002-3481-4892</orcidid><orcidid>https://orcid.org/0000-0001-9972-7370</orcidid><orcidid>https://orcid.org/0000-0002-7389-5883</orcidid></search><sort><creationdate>20231130</creationdate><title>Semantic Enhanced Video Captioning with Multi-feature Fusion</title><author>Niu, Tian-Zi ; Dong, Shan-Shan ; Chen, Zhen-Duo ; Luo, Xin ; Guo, Shanqing ; Huang, Zi ; Xu, Xin-Shun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a277t-719b138a197e8b8fa930bcda0d216778826173c2d8e1d2b94f74fde09bb9bca13</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computing methodologies</topic><topic>Machine learning</topic><topic>Neural networks</topic><topic>Video summarization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Niu, Tian-Zi</creatorcontrib><creatorcontrib>Dong, Shan-Shan</creatorcontrib><creatorcontrib>Chen, Zhen-Duo</creatorcontrib><creatorcontrib>Luo, Xin</creatorcontrib><creatorcontrib>Guo, Shanqing</creatorcontrib><creatorcontrib>Huang, Zi</creatorcontrib><creatorcontrib>Xu, Xin-Shun</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on multimedia computing communications and applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Niu, Tian-Zi</au><au>Dong, Shan-Shan</au><au>Chen, Zhen-Duo</au><au>Luo, Xin</au><au>Guo, Shanqing</au><au>Huang, Zi</au><au>Xu, Xin-Shun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Semantic Enhanced Video Captioning with Multi-feature Fusion</atitle><jtitle>ACM transactions on multimedia computing communications and applications</jtitle><stitle>ACM TOMM</stitle><date>2023-11-30</date><risdate>2023</risdate><volume>19</volume><issue>6</issue><spage>1</spage><epage>21</epage><pages>1-21</pages><issn>1551-6857</issn><eissn>1551-6865</eissn><abstract>Video captioning aims to automatically describe a video clip with informative sentences. At present, deep learning-based models have become the mainstream for this task and achieved competitive results on public datasets. Usually, these methods leverage different types of features to generate sentences, e.g. semantic information, 2D or 3D features. However, some methods only treat semantic information as a complement of visual representations, and cannot fully exploit it; some of them ignore the relationship between different types of features. In addition, most of them select multiple frames of a video with an equally spaced sampling scheme, resulting in much redundant information. To address these issues, we present a novel video captioning framework, Semantic Enhanced video captioning with Multi-feature Fusion, SEMF for short. It optimizes the use of different types of features from three aspects. First of all, a semantic encoder is designed to enhance meaningful semantic features through a semantic dictionary to boost performance. Secondly, a discrete selection module pays attention to important features and obtains different contexts at different steps to reduce feature redundancy. Finally, a multi-feature fusion module uses a novel relation-aware attention mechanism to separate the common and complementary components of different features to provide more effective visual features for the next step. Moreover, the entire framework can be trained in an end-to-end manner. Extensive experiments are conducted on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSR-VTT) datasets. The results demonstrate that SEMF is able to achieve state-of-the-art results.</abstract><cop>New York, NY</cop><pub>ACM</pub><doi>10.1145/3588572</doi><tpages>21</tpages><orcidid>https://orcid.org/0000-0002-6901-5476</orcidid><orcidid>https://orcid.org/0000-0003-3367-0951</orcidid><orcidid>https://orcid.org/0000-0002-2500-9488</orcidid><orcidid>https://orcid.org/0000-0002-9738-4949</orcidid><orcidid>https://orcid.org/0000-0002-3481-4892</orcidid><orcidid>https://orcid.org/0000-0001-9972-7370</orcidid><orcidid>https://orcid.org/0000-0002-7389-5883</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1551-6857
ispartof	ACM transactions on multimedia computing communications and applications, 2023-11, Vol.19 (6), p.1-21
issn	1551-6857 1551-6865
language	eng
recordid	cdi_crossref_primary_10_1145_3588572
source	ACM Digital Library Complete
subjects	Computing methodologies Machine learning Neural networks Video summarization
title	Semantic Enhanced Video Captioning with Multi-feature Fusion
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-30T07%3A35%3A39IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Semantic%20Enhanced%20Video%20Captioning%20with%20Multi-feature%20Fusion&rft.jtitle=ACM%20transactions%20on%20multimedia%20computing%20communications%20and%20applications&rft.au=Niu,%20Tian-Zi&rft.date=2023-11-30&rft.volume=19&rft.issue=6&rft.spage=1&rft.epage=21&rft.pages=1-21&rft.issn=1551-6857&rft.eissn=1551-6865&rft_id=info:doi/10.1145/3588572&rft_dat=%3Cacm_cross%3E3588572%3C/acm_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true