Enhancing the alignment between target words and corresponding frames for video captioning
•Visual tags are introduced to bridge the gap between vision and language.•A textual-temporal attention model is devised and incorporated into the decoder to build exact alignment between target words and corresponding frames.•Extensive experiments on two well-known datasets, i.e., MSVD and MSR-VTT,...
Gespeichert in:
Veröffentlicht in: | Pattern recognition 2021-03, Vol.111, p.107702, Article 107702 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | 107702 |
container_title | Pattern recognition |
container_volume | 111 |
creator | Tu, Yunbin Zhou, Chang Guo, Junjun Gao, Shengxiang Yu, Zhengtao |
description | •Visual tags are introduced to bridge the gap between vision and language.•A textual-temporal attention model is devised and incorporated into the decoder to build exact alignment between target words and corresponding frames.•Extensive experiments on two well-known datasets, i.e., MSVD and MSR-VTT, demonstrate that our proposed approach achieves remarkable improvements over the state-of-the-art methods.
Video captioning aims at translating from a sequence of video frames into a sequence of words with the encoder-decoder framework. Hence, it is critical to align these two different sequences. Most existing methods exploit soft-attention (temporal attention) mechanism to align target words with corresponding frames, where the relevance of them merely depends on the previously generated words (i.e., language context). As we know, however, there is an inherent gap between vision and language, and most of the words in a caption belong to non-visual words (e.g. “a”, “is”, and “in”). Hence, merely with the guidance of the language context, existing temporal attention-based methods cannot exactly align target words with corresponding frames. In order to address this problem, we first introduce pre-detected visual tags from the video to bridge the gap between vision and language. The reason is that visual tags not only belong to textual modality, but also can convey visual information. Then, we present a Textual-Temporal Attention Model (TTA) to exactly align the target words with corresponding frames. The experimental results show that our proposed method outperforms the state-of-the-art methods on two well known datasets, i.e., MSVD and MSR-VTT. 11Our code is available at https://github.com/tuyunbin/Enhancing-the-Alignment-between-Target-Words-and-Corresponding-Frames-for-Video-Captioning |
doi_str_mv | 10.1016/j.patcog.2020.107702 |
format | Article |
fullrecord | <record><control><sourceid>elsevier_webof</sourceid><recordid>TN_cdi_webofscience_primary_000601159400011</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0031320320305057</els_id><sourcerecordid>S0031320320305057</sourcerecordid><originalsourceid>FETCH-LOGICAL-c306t-23ac5d0978f711dfc519f1b561cc44f38bbbdb2298dce3dca54114aa8ed2b2543</originalsourceid><addsrcrecordid>eNqNkE1LAzEQhoMoWKv_wEPusjWT7Ha3F0FK_YCCF714Cdlk0qa0SUlii__eXbZ4FE8zzDvPMDyE3AKbAIPp_WayV1mH1YQz3o_qmvEzMoKmFkUFJT8nI8YEFIIzcUmuUtowBnUXjMjnwq-V186vaF4jVVu38jv0mbaYj4ieZhVXmOkxRJOo8obqECOmffCmh2xUO0zUhkgPzmCgWu2zC77LrsmFVduEN6c6Jh9Pi_f5S7F8e36dPy4LLdg0F1woXRk2qxtbAxirK5hZaKspaF2WVjRt25qW81ljNAqjVVUClEo1aHjLq1KMSTnc1TGkFNHKfXQ7Fb8lMNn7kRs5-JG9Hzn46bC7ATtiG2zSDr3GX5QxNmUA1azsOoBuu_n_9txl1UuYhy-fO_RhQLGTcHAY5Qk3LqLO0gT396c_KeeTAw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Enhancing the alignment between target words and corresponding frames for video captioning</title><source>Web of Science - Science Citation Index Expanded - 2021<img src="https://exlibris-pub.s3.amazonaws.com/fromwos-v2.jpg" /></source><source>Access via ScienceDirect (Elsevier)</source><creator>Tu, Yunbin ; Zhou, Chang ; Guo, Junjun ; Gao, Shengxiang ; Yu, Zhengtao</creator><creatorcontrib>Tu, Yunbin ; Zhou, Chang ; Guo, Junjun ; Gao, Shengxiang ; Yu, Zhengtao</creatorcontrib><description>•Visual tags are introduced to bridge the gap between vision and language.•A textual-temporal attention model is devised and incorporated into the decoder to build exact alignment between target words and corresponding frames.•Extensive experiments on two well-known datasets, i.e., MSVD and MSR-VTT, demonstrate that our proposed approach achieves remarkable improvements over the state-of-the-art methods.
Video captioning aims at translating from a sequence of video frames into a sequence of words with the encoder-decoder framework. Hence, it is critical to align these two different sequences. Most existing methods exploit soft-attention (temporal attention) mechanism to align target words with corresponding frames, where the relevance of them merely depends on the previously generated words (i.e., language context). As we know, however, there is an inherent gap between vision and language, and most of the words in a caption belong to non-visual words (e.g. “a”, “is”, and “in”). Hence, merely with the guidance of the language context, existing temporal attention-based methods cannot exactly align target words with corresponding frames. In order to address this problem, we first introduce pre-detected visual tags from the video to bridge the gap between vision and language. The reason is that visual tags not only belong to textual modality, but also can convey visual information. Then, we present a Textual-Temporal Attention Model (TTA) to exactly align the target words with corresponding frames. The experimental results show that our proposed method outperforms the state-of-the-art methods on two well known datasets, i.e., MSVD and MSR-VTT. 11Our code is available at https://github.com/tuyunbin/Enhancing-the-Alignment-between-Target-Words-and-Corresponding-Frames-for-Video-Captioning</description><identifier>ISSN: 0031-3203</identifier><identifier>EISSN: 1873-5142</identifier><identifier>DOI: 10.1016/j.patcog.2020.107702</identifier><language>eng</language><publisher>London: Elsevier Ltd</publisher><subject>Alignment ; Computer Science ; Computer Science, Artificial Intelligence ; Engineering ; Engineering, Electrical & Electronic ; Science & Technology ; Technology ; Textual-temporal attention ; Video captioning ; Visual tags</subject><ispartof>Pattern recognition, 2021-03, Vol.111, p.107702, Article 107702</ispartof><rights>2020</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>true</woscitedreferencessubscribed><woscitedreferencescount>39</woscitedreferencescount><woscitedreferencesoriginalsourcerecordid>wos000601159400011</woscitedreferencesoriginalsourcerecordid><citedby>FETCH-LOGICAL-c306t-23ac5d0978f711dfc519f1b561cc44f38bbbdb2298dce3dca54114aa8ed2b2543</citedby><cites>FETCH-LOGICAL-c306t-23ac5d0978f711dfc519f1b561cc44f38bbbdb2298dce3dca54114aa8ed2b2543</cites><orcidid>0000-0002-4012-461X ; 0000-0003-1094-5668 ; 0000-0002-9525-9060</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.patcog.2020.107702$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>315,781,785,3551,27929,27930,39263,46000</link.rule.ids></links><search><creatorcontrib>Tu, Yunbin</creatorcontrib><creatorcontrib>Zhou, Chang</creatorcontrib><creatorcontrib>Guo, Junjun</creatorcontrib><creatorcontrib>Gao, Shengxiang</creatorcontrib><creatorcontrib>Yu, Zhengtao</creatorcontrib><title>Enhancing the alignment between target words and corresponding frames for video captioning</title><title>Pattern recognition</title><addtitle>PATTERN RECOGN</addtitle><description>•Visual tags are introduced to bridge the gap between vision and language.•A textual-temporal attention model is devised and incorporated into the decoder to build exact alignment between target words and corresponding frames.•Extensive experiments on two well-known datasets, i.e., MSVD and MSR-VTT, demonstrate that our proposed approach achieves remarkable improvements over the state-of-the-art methods.
Video captioning aims at translating from a sequence of video frames into a sequence of words with the encoder-decoder framework. Hence, it is critical to align these two different sequences. Most existing methods exploit soft-attention (temporal attention) mechanism to align target words with corresponding frames, where the relevance of them merely depends on the previously generated words (i.e., language context). As we know, however, there is an inherent gap between vision and language, and most of the words in a caption belong to non-visual words (e.g. “a”, “is”, and “in”). Hence, merely with the guidance of the language context, existing temporal attention-based methods cannot exactly align target words with corresponding frames. In order to address this problem, we first introduce pre-detected visual tags from the video to bridge the gap between vision and language. The reason is that visual tags not only belong to textual modality, but also can convey visual information. Then, we present a Textual-Temporal Attention Model (TTA) to exactly align the target words with corresponding frames. The experimental results show that our proposed method outperforms the state-of-the-art methods on two well known datasets, i.e., MSVD and MSR-VTT. 11Our code is available at https://github.com/tuyunbin/Enhancing-the-Alignment-between-Target-Words-and-Corresponding-Frames-for-Video-Captioning</description><subject>Alignment</subject><subject>Computer Science</subject><subject>Computer Science, Artificial Intelligence</subject><subject>Engineering</subject><subject>Engineering, Electrical & Electronic</subject><subject>Science & Technology</subject><subject>Technology</subject><subject>Textual-temporal attention</subject><subject>Video captioning</subject><subject>Visual tags</subject><issn>0031-3203</issn><issn>1873-5142</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>HGBXW</sourceid><recordid>eNqNkE1LAzEQhoMoWKv_wEPusjWT7Ha3F0FK_YCCF714Cdlk0qa0SUlii__eXbZ4FE8zzDvPMDyE3AKbAIPp_WayV1mH1YQz3o_qmvEzMoKmFkUFJT8nI8YEFIIzcUmuUtowBnUXjMjnwq-V186vaF4jVVu38jv0mbaYj4ieZhVXmOkxRJOo8obqECOmffCmh2xUO0zUhkgPzmCgWu2zC77LrsmFVduEN6c6Jh9Pi_f5S7F8e36dPy4LLdg0F1woXRk2qxtbAxirK5hZaKspaF2WVjRt25qW81ljNAqjVVUClEo1aHjLq1KMSTnc1TGkFNHKfXQ7Fb8lMNn7kRs5-JG9Hzn46bC7ATtiG2zSDr3GX5QxNmUA1azsOoBuu_n_9txl1UuYhy-fO_RhQLGTcHAY5Qk3LqLO0gT396c_KeeTAw</recordid><startdate>202103</startdate><enddate>202103</enddate><creator>Tu, Yunbin</creator><creator>Zhou, Chang</creator><creator>Guo, Junjun</creator><creator>Gao, Shengxiang</creator><creator>Yu, Zhengtao</creator><general>Elsevier Ltd</general><general>Elsevier</general><scope>BLEPL</scope><scope>DTL</scope><scope>HGBXW</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-4012-461X</orcidid><orcidid>https://orcid.org/0000-0003-1094-5668</orcidid><orcidid>https://orcid.org/0000-0002-9525-9060</orcidid></search><sort><creationdate>202103</creationdate><title>Enhancing the alignment between target words and corresponding frames for video captioning</title><author>Tu, Yunbin ; Zhou, Chang ; Guo, Junjun ; Gao, Shengxiang ; Yu, Zhengtao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c306t-23ac5d0978f711dfc519f1b561cc44f38bbbdb2298dce3dca54114aa8ed2b2543</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Alignment</topic><topic>Computer Science</topic><topic>Computer Science, Artificial Intelligence</topic><topic>Engineering</topic><topic>Engineering, Electrical & Electronic</topic><topic>Science & Technology</topic><topic>Technology</topic><topic>Textual-temporal attention</topic><topic>Video captioning</topic><topic>Visual tags</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Tu, Yunbin</creatorcontrib><creatorcontrib>Zhou, Chang</creatorcontrib><creatorcontrib>Guo, Junjun</creatorcontrib><creatorcontrib>Gao, Shengxiang</creatorcontrib><creatorcontrib>Yu, Zhengtao</creatorcontrib><collection>Web of Science Core Collection</collection><collection>Science Citation Index Expanded</collection><collection>Web of Science - Science Citation Index Expanded - 2021</collection><collection>CrossRef</collection><jtitle>Pattern recognition</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Tu, Yunbin</au><au>Zhou, Chang</au><au>Guo, Junjun</au><au>Gao, Shengxiang</au><au>Yu, Zhengtao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Enhancing the alignment between target words and corresponding frames for video captioning</atitle><jtitle>Pattern recognition</jtitle><stitle>PATTERN RECOGN</stitle><date>2021-03</date><risdate>2021</risdate><volume>111</volume><spage>107702</spage><pages>107702-</pages><artnum>107702</artnum><issn>0031-3203</issn><eissn>1873-5142</eissn><abstract>•Visual tags are introduced to bridge the gap between vision and language.•A textual-temporal attention model is devised and incorporated into the decoder to build exact alignment between target words and corresponding frames.•Extensive experiments on two well-known datasets, i.e., MSVD and MSR-VTT, demonstrate that our proposed approach achieves remarkable improvements over the state-of-the-art methods.
Video captioning aims at translating from a sequence of video frames into a sequence of words with the encoder-decoder framework. Hence, it is critical to align these two different sequences. Most existing methods exploit soft-attention (temporal attention) mechanism to align target words with corresponding frames, where the relevance of them merely depends on the previously generated words (i.e., language context). As we know, however, there is an inherent gap between vision and language, and most of the words in a caption belong to non-visual words (e.g. “a”, “is”, and “in”). Hence, merely with the guidance of the language context, existing temporal attention-based methods cannot exactly align target words with corresponding frames. In order to address this problem, we first introduce pre-detected visual tags from the video to bridge the gap between vision and language. The reason is that visual tags not only belong to textual modality, but also can convey visual information. Then, we present a Textual-Temporal Attention Model (TTA) to exactly align the target words with corresponding frames. The experimental results show that our proposed method outperforms the state-of-the-art methods on two well known datasets, i.e., MSVD and MSR-VTT. 11Our code is available at https://github.com/tuyunbin/Enhancing-the-Alignment-between-Target-Words-and-Corresponding-Frames-for-Video-Captioning</abstract><cop>London</cop><pub>Elsevier Ltd</pub><doi>10.1016/j.patcog.2020.107702</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0002-4012-461X</orcidid><orcidid>https://orcid.org/0000-0003-1094-5668</orcidid><orcidid>https://orcid.org/0000-0002-9525-9060</orcidid></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0031-3203 |
ispartof | Pattern recognition, 2021-03, Vol.111, p.107702, Article 107702 |
issn | 0031-3203 1873-5142 |
language | eng |
recordid | cdi_webofscience_primary_000601159400011 |
source | Web of Science - Science Citation Index Expanded - 2021<img src="https://exlibris-pub.s3.amazonaws.com/fromwos-v2.jpg" />; Access via ScienceDirect (Elsevier) |
subjects | Alignment Computer Science Computer Science, Artificial Intelligence Engineering Engineering, Electrical & Electronic Science & Technology Technology Textual-temporal attention Video captioning Visual tags |
title | Enhancing the alignment between target words and corresponding frames for video captioning |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-16T07%3A38%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-elsevier_webof&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Enhancing%20the%20alignment%20between%20target%20words%20and%20corresponding%20frames%20for%20video%20captioning&rft.jtitle=Pattern%20recognition&rft.au=Tu,%20Yunbin&rft.date=2021-03&rft.volume=111&rft.spage=107702&rft.pages=107702-&rft.artnum=107702&rft.issn=0031-3203&rft.eissn=1873-5142&rft_id=info:doi/10.1016/j.patcog.2020.107702&rft_dat=%3Celsevier_webof%3ES0031320320305057%3C/elsevier_webof%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_els_id=S0031320320305057&rfr_iscdi=true |