Enhancing the alignment between target words and corresponding frames for video captioning

•Visual tags are introduced to bridge the gap between vision and language.•A textual-temporal attention model is devised and incorporated into the decoder to build exact alignment between target words and corresponding frames.•Extensive experiments on two well-known datasets, i.e., MSVD and MSR-VTT,...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Pattern recognition 2021-03, Vol.111, p.107702, Article 107702
Hauptverfasser:	Tu, Yunbin, Zhou, Chang, Guo, Junjun, Gao, Shengxiang, Yu, Zhengtao
Format:	Artikel
Sprache:	eng
Schlagworte:	Alignment Computer Science Computer Science, Artificial Intelligence Engineering Engineering, Electrical & Electronic Science & Technology Technology Textual-temporal attention Video captioning Visual tags
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page	107702
container_title	Pattern recognition
container_volume	111
creator	Tu, Yunbin Zhou, Chang Guo, Junjun Gao, Shengxiang Yu, Zhengtao
description	•Visual tags are introduced to bridge the gap between vision and language.•A textual-temporal attention model is devised and incorporated into the decoder to build exact alignment between target words and corresponding frames.•Extensive experiments on two well-known datasets, i.e., MSVD and MSR-VTT, demonstrate that our proposed approach achieves remarkable improvements over the state-of-the-art methods. Video captioning aims at translating from a sequence of video frames into a sequence of words with the encoder-decoder framework. Hence, it is critical to align these two different sequences. Most existing methods exploit soft-attention (temporal attention) mechanism to align target words with corresponding frames, where the relevance of them merely depends on the previously generated words (i.e., language context). As we know, however, there is an inherent gap between vision and language, and most of the words in a caption belong to non-visual words (e.g. “a”, “is”, and “in”). Hence, merely with the guidance of the language context, existing temporal attention-based methods cannot exactly align target words with corresponding frames. In order to address this problem, we first introduce pre-detected visual tags from the video to bridge the gap between vision and language. The reason is that visual tags not only belong to textual modality, but also can convey visual information. Then, we present a Textual-Temporal Attention Model (TTA) to exactly align the target words with corresponding frames. The experimental results show that our proposed method outperforms the state-of-the-art methods on two well known datasets, i.e., MSVD and MSR-VTT. 11Our code is available at https://github.com/tuyunbin/Enhancing-the-Alignment-between-Target-Words-and-Corresponding-Frames-for-Video-Captioning
doi_str_mv	10.1016/j.patcog.2020.107702
format	Article
fullrecord	<record><control><sourceid>elsevier_webof</sourceid><recordid>TN_cdi_webofscience_primary_000601159400011</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0031320320305057</els_id><sourcerecordid>S0031320320305057</sourcerecordid><originalsourceid>FETCH-LOGICAL-c306t-23ac5d0978f711dfc519f1b561cc44f38bbbdb2298dce3dca54114aa8ed2b2543</originalsourceid><addsrcrecordid>eNqNkE1LAzEQhoMoWKv_wEPusjWT7Ha3F0FK_YCCF714Cdlk0qa0SUlii__eXbZ4FE8zzDvPMDyE3AKbAIPp_WayV1mH1YQz3o_qmvEzMoKmFkUFJT8nI8YEFIIzcUmuUtowBnUXjMjnwq-V186vaF4jVVu38jv0mbaYj4ieZhVXmOkxRJOo8obqECOmffCmh2xUO0zUhkgPzmCgWu2zC77LrsmFVduEN6c6Jh9Pi_f5S7F8e36dPy4LLdg0F1woXRk2qxtbAxirK5hZaKspaF2WVjRt25qW81ljNAqjVVUClEo1aHjLq1KMSTnc1TGkFNHKfXQ7Fb8lMNn7kRs5-JG9Hzn46bC7ATtiG2zSDr3GX5QxNmUA1azsOoBuu_n_9txl1UuYhy-fO_RhQLGTcHAY5Qk3LqLO0gT396c_KeeTAw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Enhancing the alignment between target words and corresponding frames for video captioning</title><source>Web of Science - Science Citation Index Expanded - 2021<img src="https://exlibris-pub.s3.amazonaws.com/fromwos-v2.jpg" /></source><source>Access via ScienceDirect (Elsevier)</source><creator>Tu, Yunbin ; Zhou, Chang ; Guo, Junjun ; Gao, Shengxiang ; Yu, Zhengtao</creator><creatorcontrib>Tu, Yunbin ; Zhou, Chang ; Guo, Junjun ; Gao, Shengxiang ; Yu, Zhengtao</creatorcontrib><description>•Visual tags are introduced to bridge the gap between vision and language.•A textual-temporal attention model is devised and incorporated into the decoder to build exact alignment between target words and corresponding frames.•Extensive experiments on two well-known datasets, i.e., MSVD and MSR-VTT, demonstrate that our proposed approach achieves remarkable improvements over the state-of-the-art methods. Video captioning aims at translating from a sequence of video frames into a sequence of words with the encoder-decoder framework. Hence, it is critical to align these two different sequences. Most existing methods exploit soft-attention (temporal attention) mechanism to align target words with corresponding frames, where the relevance of them merely depends on the previously generated words (i.e., language context). As we know, however, there is an inherent gap between vision and language, and most of the words in a caption belong to non-visual words (e.g. “a”, “is”, and “in”). Hence, merely with the guidance of the language context, existing temporal attention-based methods cannot exactly align target words with corresponding frames. In order to address this problem, we first introduce pre-detected visual tags from the video to bridge the gap between vision and language. The reason is that visual tags not only belong to textual modality, but also can convey visual information. Then, we present a Textual-Temporal Attention Model (TTA) to exactly align the target words with corresponding frames. The experimental results show that our proposed method outperforms the state-of-the-art methods on two well known datasets, i.e., MSVD and MSR-VTT. 11Our code is available at https://github.com/tuyunbin/Enhancing-the-Alignment-between-Target-Words-and-Corresponding-Frames-for-Video-Captioning</description><identifier>ISSN: 0031-3203</identifier><identifier>EISSN: 1873-5142</identifier><identifier>DOI: 10.1016/j.patcog.2020.107702</identifier><language>eng</language><publisher>London: Elsevier Ltd</publisher><subject>Alignment ; Computer Science ; Computer Science, Artificial Intelligence ; Engineering ; Engineering, Electrical & Electronic ; Science & Technology ; Technology ; Textual-temporal attention ; Video captioning ; Visual tags</subject><ispartof>Pattern recognition, 2021-03, Vol.111, p.107702, Article 107702</ispartof><rights>2020</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>true</woscitedreferencessubscribed><woscitedreferencescount>39</woscitedreferencescount><woscitedreferencesoriginalsourcerecordid>wos000601159400011</woscitedreferencesoriginalsourcerecordid><citedby>FETCH-LOGICAL-c306t-23ac5d0978f711dfc519f1b561cc44f38bbbdb2298dce3dca54114aa8ed2b2543</citedby><cites>FETCH-LOGICAL-c306t-23ac5d0978f711dfc519f1b561cc44f38bbbdb2298dce3dca54114aa8ed2b2543</cites><orcidid>0000-0002-4012-461X ; 0000-0003-1094-5668 ; 0000-0002-9525-9060</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.patcog.2020.107702$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>315,781,785,3551,27929,27930,39263,46000</link.rule.ids></links><search><creatorcontrib>Tu, Yunbin</creatorcontrib><creatorcontrib>Zhou, Chang</creatorcontrib><creatorcontrib>Guo, Junjun</creatorcontrib><creatorcontrib>Gao, Shengxiang</creatorcontrib><creatorcontrib>Yu, Zhengtao</creatorcontrib><title>Enhancing the alignment between target words and corresponding frames for video captioning</title><title>Pattern recognition</title><addtitle>PATTERN RECOGN</addtitle><description>•Visual tags are introduced to bridge the gap between vision and language.•A textual-temporal attention model is devised and incorporated into the decoder to build exact alignment between target words and corresponding frames.•Extensive experiments on two well-known datasets, i.e., MSVD and MSR-VTT, demonstrate that our proposed approach achieves remarkable improvements over the state-of-the-art methods. Video captioning aims at translating from a sequence of video frames into a sequence of words with the encoder-decoder framework. Hence, it is critical to align these two different sequences. Most existing methods exploit soft-attention (temporal attention) mechanism to align target words with corresponding frames, where the relevance of them merely depends on the previously generated words (i.e., language context). As we know, however, there is an inherent gap between vision and language, and most of the words in a caption belong to non-visual words (e.g. “a”, “is”, and “in”). Hence, merely with the guidance of the language context, existing temporal attention-based methods cannot exactly align target words with corresponding frames. In order to address this problem, we first introduce pre-detected visual tags from the video to bridge the gap between vision and language. The reason is that visual tags not only belong to textual modality, but also can convey visual information. Then, we present a Textual-Temporal Attention Model (TTA) to exactly align the target words with corresponding frames. The experimental results show that our proposed method outperforms the state-of-the-art methods on two well known datasets, i.e., MSVD and MSR-VTT. 11Our code is available at https://github.com/tuyunbin/Enhancing-the-Alignment-between-Target-Words-and-Corresponding-Frames-for-Video-Captioning</description><subject>Alignment</subject><subject>Computer Science</subject><subject>Computer Science, Artificial Intelligence</subject><subject>Engineering</subject><subject>Engineering, Electrical & Electronic</subject><subject>Science & Technology</subject><subject>Technology</subject><subject>Textual-temporal attention</subject><subject>Video captioning</subject><subject>Visual tags</subject><issn>0031-3203</issn><issn>1873-5142</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>HGBXW</sourceid><recordid>eNqNkE1LAzEQhoMoWKv_wEPusjWT7Ha3F0FK_YCCF714Cdlk0qa0SUlii__eXbZ4FE8zzDvPMDyE3AKbAIPp_WayV1mH1YQz3o_qmvEzMoKmFkUFJT8nI8YEFIIzcUmuUtowBnUXjMjnwq-V186vaF4jVVu38jv0mbaYj4ieZhVXmOkxRJOo8obqECOmffCmh2xUO0zUhkgPzmCgWu2zC77LrsmFVduEN6c6Jh9Pi_f5S7F8e36dPy4LLdg0F1woXRk2qxtbAxirK5hZaKspaF2WVjRt25qW81ljNAqjVVUClEo1aHjLq1KMSTnc1TGkFNHKfXQ7Fb8lMNn7kRs5-JG9Hzn46bC7ATtiG2zSDr3GX5QxNmUA1azsOoBuu_n_9txl1UuYhy-fO_RhQLGTcHAY5Qk3LqLO0gT396c_KeeTAw</recordid><startdate>202103</startdate><enddate>202103</enddate><creator>Tu, Yunbin</creator><creator>Zhou, Chang</creator><creator>Guo, Junjun</creator><creator>Gao, Shengxiang</creator><creator>Yu, Zhengtao</creator><general>Elsevier Ltd</general><general>Elsevier</general><scope>BLEPL</scope><scope>DTL</scope><scope>HGBXW</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-4012-461X</orcidid><orcidid>https://orcid.org/0000-0003-1094-5668</orcidid><orcidid>https://orcid.org/0000-0002-9525-9060</orcidid></search><sort><creationdate>202103</creationdate><title>Enhancing the alignment between target words and corresponding frames for video captioning</title><author>Tu, Yunbin ; Zhou, Chang ; Guo, Junjun ; Gao, Shengxiang ; Yu, Zhengtao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c306t-23ac5d0978f711dfc519f1b561cc44f38bbbdb2298dce3dca54114aa8ed2b2543</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Alignment</topic><topic>Computer Science</topic><topic>Computer Science, Artificial Intelligence</topic><topic>Engineering</topic><topic>Engineering, Electrical & Electronic</topic><topic>Science & Technology</topic><topic>Technology</topic><topic>Textual-temporal attention</topic><topic>Video captioning</topic><topic>Visual tags</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Tu, Yunbin</creatorcontrib><creatorcontrib>Zhou, Chang</creatorcontrib><creatorcontrib>Guo, Junjun</creatorcontrib><creatorcontrib>Gao, Shengxiang</creatorcontrib><creatorcontrib>Yu, Zhengtao</creatorcontrib><collection>Web of Science Core Collection</collection><collection>Science Citation Index Expanded</collection><collection>Web of Science - Science Citation Index Expanded - 2021</collection><collection>CrossRef</collection><jtitle>Pattern recognition</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Tu, Yunbin</au><au>Zhou, Chang</au><au>Guo, Junjun</au><au>Gao, Shengxiang</au><au>Yu, Zhengtao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Enhancing the alignment between target words and corresponding frames for video captioning</atitle><jtitle>Pattern recognition</jtitle><stitle>PATTERN RECOGN</stitle><date>2021-03</date><risdate>2021</risdate><volume>111</volume><spage>107702</spage><pages>107702-</pages><artnum>107702</artnum><issn>0031-3203</issn><eissn>1873-5142</eissn><abstract>•Visual tags are introduced to bridge the gap between vision and language.•A textual-temporal attention model is devised and incorporated into the decoder to build exact alignment between target words and corresponding frames.•Extensive experiments on two well-known datasets, i.e., MSVD and MSR-VTT, demonstrate that our proposed approach achieves remarkable improvements over the state-of-the-art methods. Video captioning aims at translating from a sequence of video frames into a sequence of words with the encoder-decoder framework. Hence, it is critical to align these two different sequences. Most existing methods exploit soft-attention (temporal attention) mechanism to align target words with corresponding frames, where the relevance of them merely depends on the previously generated words (i.e., language context). As we know, however, there is an inherent gap between vision and language, and most of the words in a caption belong to non-visual words (e.g. “a”, “is”, and “in”). Hence, merely with the guidance of the language context, existing temporal attention-based methods cannot exactly align target words with corresponding frames. In order to address this problem, we first introduce pre-detected visual tags from the video to bridge the gap between vision and language. The reason is that visual tags not only belong to textual modality, but also can convey visual information. Then, we present a Textual-Temporal Attention Model (TTA) to exactly align the target words with corresponding frames. The experimental results show that our proposed method outperforms the state-of-the-art methods on two well known datasets, i.e., MSVD and MSR-VTT. 11Our code is available at https://github.com/tuyunbin/Enhancing-the-Alignment-between-Target-Words-and-Corresponding-Frames-for-Video-Captioning</abstract><cop>London</cop><pub>Elsevier Ltd</pub><doi>10.1016/j.patcog.2020.107702</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0002-4012-461X</orcidid><orcidid>https://orcid.org/0000-0003-1094-5668</orcidid><orcidid>https://orcid.org/0000-0002-9525-9060</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 0031-3203
ispartof	Pattern recognition, 2021-03, Vol.111, p.107702, Article 107702
issn	0031-3203 1873-5142
language	eng
recordid	cdi_webofscience_primary_000601159400011
source	Web of Science - Science Citation Index Expanded - 2021<img src="https://exlibris-pub.s3.amazonaws.com/fromwos-v2.jpg" />; Access via ScienceDirect (Elsevier)
subjects	Alignment Computer Science Computer Science, Artificial Intelligence Engineering Engineering, Electrical & Electronic Science & Technology Technology Textual-temporal attention Video captioning Visual tags
title	Enhancing the alignment between target words and corresponding frames for video captioning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-16T07%3A38%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-elsevier_webof&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Enhancing%20the%20alignment%20between%20target%20words%20and%20corresponding%20frames%20for%20video%20captioning&rft.jtitle=Pattern%20recognition&rft.au=Tu,%20Yunbin&rft.date=2021-03&rft.volume=111&rft.spage=107702&rft.pages=107702-&rft.artnum=107702&rft.issn=0031-3203&rft.eissn=1873-5142&rft_id=info:doi/10.1016/j.patcog.2020.107702&rft_dat=%3Celsevier_webof%3ES0031320320305057%3C/elsevier_webof%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_els_id=S0031320320305057&rfr_iscdi=true