Enhancing the alignment between target words and corresponding frames for video captioning

•Visual tags are introduced to bridge the gap between vision and language.•A textual-temporal attention model is devised and incorporated into the decoder to build exact alignment between target words and corresponding frames.•Extensive experiments on two well-known datasets, i.e., MSVD and MSR-VTT,...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Pattern recognition 2021-03, Vol.111, p.107702, Article 107702
Hauptverfasser: Tu, Yunbin, Zhou, Chang, Guo, Junjun, Gao, Shengxiang, Yu, Zhengtao
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page 107702
container_title Pattern recognition
container_volume 111
creator Tu, Yunbin
Zhou, Chang
Guo, Junjun
Gao, Shengxiang
Yu, Zhengtao
description •Visual tags are introduced to bridge the gap between vision and language.•A textual-temporal attention model is devised and incorporated into the decoder to build exact alignment between target words and corresponding frames.•Extensive experiments on two well-known datasets, i.e., MSVD and MSR-VTT, demonstrate that our proposed approach achieves remarkable improvements over the state-of-the-art methods. Video captioning aims at translating from a sequence of video frames into a sequence of words with the encoder-decoder framework. Hence, it is critical to align these two different sequences. Most existing methods exploit soft-attention (temporal attention) mechanism to align target words with corresponding frames, where the relevance of them merely depends on the previously generated words (i.e., language context). As we know, however, there is an inherent gap between vision and language, and most of the words in a caption belong to non-visual words (e.g. “a”, “is”, and “in”). Hence, merely with the guidance of the language context, existing temporal attention-based methods cannot exactly align target words with corresponding frames. In order to address this problem, we first introduce pre-detected visual tags from the video to bridge the gap between vision and language. The reason is that visual tags not only belong to textual modality, but also can convey visual information. Then, we present a Textual-Temporal Attention Model (TTA) to exactly align the target words with corresponding frames. The experimental results show that our proposed method outperforms the state-of-the-art methods on two well known datasets, i.e., MSVD and MSR-VTT. 11Our code is available at https://github.com/tuyunbin/Enhancing-the-Alignment-between-Target-Words-and-Corresponding-Frames-for-Video-Captioning
doi_str_mv 10.1016/j.patcog.2020.107702
format Article
fullrecord <record><control><sourceid>elsevier_webof</sourceid><recordid>TN_cdi_webofscience_primary_000601159400011</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0031320320305057</els_id><sourcerecordid>S0031320320305057</sourcerecordid><originalsourceid>FETCH-LOGICAL-c306t-23ac5d0978f711dfc519f1b561cc44f38bbbdb2298dce3dca54114aa8ed2b2543</originalsourceid><addsrcrecordid>eNqNkE1LAzEQhoMoWKv_wEPusjWT7Ha3F0FK_YCCF714Cdlk0qa0SUlii__eXbZ4FE8zzDvPMDyE3AKbAIPp_WayV1mH1YQz3o_qmvEzMoKmFkUFJT8nI8YEFIIzcUmuUtowBnUXjMjnwq-V186vaF4jVVu38jv0mbaYj4ieZhVXmOkxRJOo8obqECOmffCmh2xUO0zUhkgPzmCgWu2zC77LrsmFVduEN6c6Jh9Pi_f5S7F8e36dPy4LLdg0F1woXRk2qxtbAxirK5hZaKspaF2WVjRt25qW81ljNAqjVVUClEo1aHjLq1KMSTnc1TGkFNHKfXQ7Fb8lMNn7kRs5-JG9Hzn46bC7ATtiG2zSDr3GX5QxNmUA1azsOoBuu_n_9txl1UuYhy-fO_RhQLGTcHAY5Qk3LqLO0gT396c_KeeTAw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Enhancing the alignment between target words and corresponding frames for video captioning</title><source>Web of Science - Science Citation Index Expanded - 2021&lt;img src="https://exlibris-pub.s3.amazonaws.com/fromwos-v2.jpg" /&gt;</source><source>Access via ScienceDirect (Elsevier)</source><creator>Tu, Yunbin ; Zhou, Chang ; Guo, Junjun ; Gao, Shengxiang ; Yu, Zhengtao</creator><creatorcontrib>Tu, Yunbin ; Zhou, Chang ; Guo, Junjun ; Gao, Shengxiang ; Yu, Zhengtao</creatorcontrib><description>•Visual tags are introduced to bridge the gap between vision and language.•A textual-temporal attention model is devised and incorporated into the decoder to build exact alignment between target words and corresponding frames.•Extensive experiments on two well-known datasets, i.e., MSVD and MSR-VTT, demonstrate that our proposed approach achieves remarkable improvements over the state-of-the-art methods. Video captioning aims at translating from a sequence of video frames into a sequence of words with the encoder-decoder framework. Hence, it is critical to align these two different sequences. Most existing methods exploit soft-attention (temporal attention) mechanism to align target words with corresponding frames, where the relevance of them merely depends on the previously generated words (i.e., language context). As we know, however, there is an inherent gap between vision and language, and most of the words in a caption belong to non-visual words (e.g. “a”, “is”, and “in”). Hence, merely with the guidance of the language context, existing temporal attention-based methods cannot exactly align target words with corresponding frames. In order to address this problem, we first introduce pre-detected visual tags from the video to bridge the gap between vision and language. The reason is that visual tags not only belong to textual modality, but also can convey visual information. Then, we present a Textual-Temporal Attention Model (TTA) to exactly align the target words with corresponding frames. The experimental results show that our proposed method outperforms the state-of-the-art methods on two well known datasets, i.e., MSVD and MSR-VTT. 11Our code is available at https://github.com/tuyunbin/Enhancing-the-Alignment-between-Target-Words-and-Corresponding-Frames-for-Video-Captioning</description><identifier>ISSN: 0031-3203</identifier><identifier>EISSN: 1873-5142</identifier><identifier>DOI: 10.1016/j.patcog.2020.107702</identifier><language>eng</language><publisher>London: Elsevier Ltd</publisher><subject>Alignment ; Computer Science ; Computer Science, Artificial Intelligence ; Engineering ; Engineering, Electrical &amp; Electronic ; Science &amp; Technology ; Technology ; Textual-temporal attention ; Video captioning ; Visual tags</subject><ispartof>Pattern recognition, 2021-03, Vol.111, p.107702, Article 107702</ispartof><rights>2020</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>true</woscitedreferencessubscribed><woscitedreferencescount>39</woscitedreferencescount><woscitedreferencesoriginalsourcerecordid>wos000601159400011</woscitedreferencesoriginalsourcerecordid><citedby>FETCH-LOGICAL-c306t-23ac5d0978f711dfc519f1b561cc44f38bbbdb2298dce3dca54114aa8ed2b2543</citedby><cites>FETCH-LOGICAL-c306t-23ac5d0978f711dfc519f1b561cc44f38bbbdb2298dce3dca54114aa8ed2b2543</cites><orcidid>0000-0002-4012-461X ; 0000-0003-1094-5668 ; 0000-0002-9525-9060</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.patcog.2020.107702$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>315,781,785,3551,27929,27930,39263,46000</link.rule.ids></links><search><creatorcontrib>Tu, Yunbin</creatorcontrib><creatorcontrib>Zhou, Chang</creatorcontrib><creatorcontrib>Guo, Junjun</creatorcontrib><creatorcontrib>Gao, Shengxiang</creatorcontrib><creatorcontrib>Yu, Zhengtao</creatorcontrib><title>Enhancing the alignment between target words and corresponding frames for video captioning</title><title>Pattern recognition</title><addtitle>PATTERN RECOGN</addtitle><description>•Visual tags are introduced to bridge the gap between vision and language.•A textual-temporal attention model is devised and incorporated into the decoder to build exact alignment between target words and corresponding frames.•Extensive experiments on two well-known datasets, i.e., MSVD and MSR-VTT, demonstrate that our proposed approach achieves remarkable improvements over the state-of-the-art methods. Video captioning aims at translating from a sequence of video frames into a sequence of words with the encoder-decoder framework. Hence, it is critical to align these two different sequences. Most existing methods exploit soft-attention (temporal attention) mechanism to align target words with corresponding frames, where the relevance of them merely depends on the previously generated words (i.e., language context). As we know, however, there is an inherent gap between vision and language, and most of the words in a caption belong to non-visual words (e.g. “a”, “is”, and “in”). Hence, merely with the guidance of the language context, existing temporal attention-based methods cannot exactly align target words with corresponding frames. In order to address this problem, we first introduce pre-detected visual tags from the video to bridge the gap between vision and language. The reason is that visual tags not only belong to textual modality, but also can convey visual information. Then, we present a Textual-Temporal Attention Model (TTA) to exactly align the target words with corresponding frames. The experimental results show that our proposed method outperforms the state-of-the-art methods on two well known datasets, i.e., MSVD and MSR-VTT. 11Our code is available at https://github.com/tuyunbin/Enhancing-the-Alignment-between-Target-Words-and-Corresponding-Frames-for-Video-Captioning</description><subject>Alignment</subject><subject>Computer Science</subject><subject>Computer Science, Artificial Intelligence</subject><subject>Engineering</subject><subject>Engineering, Electrical &amp; Electronic</subject><subject>Science &amp; Technology</subject><subject>Technology</subject><subject>Textual-temporal attention</subject><subject>Video captioning</subject><subject>Visual tags</subject><issn>0031-3203</issn><issn>1873-5142</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>HGBXW</sourceid><recordid>eNqNkE1LAzEQhoMoWKv_wEPusjWT7Ha3F0FK_YCCF714Cdlk0qa0SUlii__eXbZ4FE8zzDvPMDyE3AKbAIPp_WayV1mH1YQz3o_qmvEzMoKmFkUFJT8nI8YEFIIzcUmuUtowBnUXjMjnwq-V186vaF4jVVu38jv0mbaYj4ieZhVXmOkxRJOo8obqECOmffCmh2xUO0zUhkgPzmCgWu2zC77LrsmFVduEN6c6Jh9Pi_f5S7F8e36dPy4LLdg0F1woXRk2qxtbAxirK5hZaKspaF2WVjRt25qW81ljNAqjVVUClEo1aHjLq1KMSTnc1TGkFNHKfXQ7Fb8lMNn7kRs5-JG9Hzn46bC7ATtiG2zSDr3GX5QxNmUA1azsOoBuu_n_9txl1UuYhy-fO_RhQLGTcHAY5Qk3LqLO0gT396c_KeeTAw</recordid><startdate>202103</startdate><enddate>202103</enddate><creator>Tu, Yunbin</creator><creator>Zhou, Chang</creator><creator>Guo, Junjun</creator><creator>Gao, Shengxiang</creator><creator>Yu, Zhengtao</creator><general>Elsevier Ltd</general><general>Elsevier</general><scope>BLEPL</scope><scope>DTL</scope><scope>HGBXW</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-4012-461X</orcidid><orcidid>https://orcid.org/0000-0003-1094-5668</orcidid><orcidid>https://orcid.org/0000-0002-9525-9060</orcidid></search><sort><creationdate>202103</creationdate><title>Enhancing the alignment between target words and corresponding frames for video captioning</title><author>Tu, Yunbin ; Zhou, Chang ; Guo, Junjun ; Gao, Shengxiang ; Yu, Zhengtao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c306t-23ac5d0978f711dfc519f1b561cc44f38bbbdb2298dce3dca54114aa8ed2b2543</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Alignment</topic><topic>Computer Science</topic><topic>Computer Science, Artificial Intelligence</topic><topic>Engineering</topic><topic>Engineering, Electrical &amp; Electronic</topic><topic>Science &amp; Technology</topic><topic>Technology</topic><topic>Textual-temporal attention</topic><topic>Video captioning</topic><topic>Visual tags</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Tu, Yunbin</creatorcontrib><creatorcontrib>Zhou, Chang</creatorcontrib><creatorcontrib>Guo, Junjun</creatorcontrib><creatorcontrib>Gao, Shengxiang</creatorcontrib><creatorcontrib>Yu, Zhengtao</creatorcontrib><collection>Web of Science Core Collection</collection><collection>Science Citation Index Expanded</collection><collection>Web of Science - Science Citation Index Expanded - 2021</collection><collection>CrossRef</collection><jtitle>Pattern recognition</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Tu, Yunbin</au><au>Zhou, Chang</au><au>Guo, Junjun</au><au>Gao, Shengxiang</au><au>Yu, Zhengtao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Enhancing the alignment between target words and corresponding frames for video captioning</atitle><jtitle>Pattern recognition</jtitle><stitle>PATTERN RECOGN</stitle><date>2021-03</date><risdate>2021</risdate><volume>111</volume><spage>107702</spage><pages>107702-</pages><artnum>107702</artnum><issn>0031-3203</issn><eissn>1873-5142</eissn><abstract>•Visual tags are introduced to bridge the gap between vision and language.•A textual-temporal attention model is devised and incorporated into the decoder to build exact alignment between target words and corresponding frames.•Extensive experiments on two well-known datasets, i.e., MSVD and MSR-VTT, demonstrate that our proposed approach achieves remarkable improvements over the state-of-the-art methods. Video captioning aims at translating from a sequence of video frames into a sequence of words with the encoder-decoder framework. Hence, it is critical to align these two different sequences. Most existing methods exploit soft-attention (temporal attention) mechanism to align target words with corresponding frames, where the relevance of them merely depends on the previously generated words (i.e., language context). As we know, however, there is an inherent gap between vision and language, and most of the words in a caption belong to non-visual words (e.g. “a”, “is”, and “in”). Hence, merely with the guidance of the language context, existing temporal attention-based methods cannot exactly align target words with corresponding frames. In order to address this problem, we first introduce pre-detected visual tags from the video to bridge the gap between vision and language. The reason is that visual tags not only belong to textual modality, but also can convey visual information. Then, we present a Textual-Temporal Attention Model (TTA) to exactly align the target words with corresponding frames. The experimental results show that our proposed method outperforms the state-of-the-art methods on two well known datasets, i.e., MSVD and MSR-VTT. 11Our code is available at https://github.com/tuyunbin/Enhancing-the-Alignment-between-Target-Words-and-Corresponding-Frames-for-Video-Captioning</abstract><cop>London</cop><pub>Elsevier Ltd</pub><doi>10.1016/j.patcog.2020.107702</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0002-4012-461X</orcidid><orcidid>https://orcid.org/0000-0003-1094-5668</orcidid><orcidid>https://orcid.org/0000-0002-9525-9060</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 0031-3203
ispartof Pattern recognition, 2021-03, Vol.111, p.107702, Article 107702
issn 0031-3203
1873-5142
language eng
recordid cdi_webofscience_primary_000601159400011
source Web of Science - Science Citation Index Expanded - 2021<img src="https://exlibris-pub.s3.amazonaws.com/fromwos-v2.jpg" />; Access via ScienceDirect (Elsevier)
subjects Alignment
Computer Science
Computer Science, Artificial Intelligence
Engineering
Engineering, Electrical & Electronic
Science & Technology
Technology
Textual-temporal attention
Video captioning
Visual tags
title Enhancing the alignment between target words and corresponding frames for video captioning
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-16T07%3A38%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-elsevier_webof&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Enhancing%20the%20alignment%20between%20target%20words%20and%20corresponding%20frames%20for%20video%20captioning&rft.jtitle=Pattern%20recognition&rft.au=Tu,%20Yunbin&rft.date=2021-03&rft.volume=111&rft.spage=107702&rft.pages=107702-&rft.artnum=107702&rft.issn=0031-3203&rft.eissn=1873-5142&rft_id=info:doi/10.1016/j.patcog.2020.107702&rft_dat=%3Celsevier_webof%3ES0031320320305057%3C/elsevier_webof%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_els_id=S0031320320305057&rfr_iscdi=true