Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism

To generate a natural language description for videos, there has been tremendous interest in developing deep neural networks with the integration of temporal structures in different categories. Considering the spatial and temporal domains inherent in video frames, we contend that the video dynamics...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Neural processing letters 2017-08, Vol.46 (1), p.313-328
Hauptverfasser: Guo, Dashan, Li, Wei, Fang, Xiangzhong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 328
container_issue 1
container_start_page 313
container_title Neural processing letters
container_volume 46
creator Guo, Dashan
Li, Wei
Fang, Xiangzhong
description To generate a natural language description for videos, there has been tremendous interest in developing deep neural networks with the integration of temporal structures in different categories. Considering the spatial and temporal domains inherent in video frames, we contend that the video dynamics and the spatio-temporal contexts are both important for captioning, which correspond to two different temporal structures. However, while the video dynamics is well investigated, the spatio-temporal contexts have not been given sufficient attention. In this paper, we take both structures into account and propose a novel recurrent convolution model for captioning. Firstly, for a comprehensive and detailed representation, we propose to aggregate the local and global spatio-temporal contexts in the recurrent convolution networks. Secondly, to capture much subtler temporal dynamics, the channel attention mechanism is introduced and it helps to understand the involvement of the frame feature maps with the captioning process. Finally, a qualitative comparison with several variants of our model demonstrates the effectiveness of incorporating these two structures. Moreover, experiments on YouTube2Text dataset have shown that the proposed method achieves competitive performance to other state-of-the-art methods.
doi_str_mv 10.1007/s11063-017-9591-9
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2918338631</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2918338631</sourcerecordid><originalsourceid>FETCH-LOGICAL-c316t-f437a06e6af6bba4690e0e6763815d2d67706c2b8aa79ff341c6d55e1f4542fa3</originalsourceid><addsrcrecordid>eNp1kMtKxDAUhoMoOI4-gLuA62hO0ybtcijeYMTFjOIupG2iHWaSmmTAeXtTqrhydS783znwIXQJ9BooFTcBgHJGKAhSFRWQ6gjNoBCMCMHejlPPBCU5z-AUnYWwoTRRGZ0hW6sh7n1v3_Fa7wbn1Ravot-3aakDNs7j177TDo-53tkx2BzwalBpIvEXqZ2N-isGrGyH6w9lrd7iRYzajhB-0m3a9WF3jk6M2gZ98VPn6OXudl0_kOXz_WO9WJKWAY_E5EwoyjVXhjeNynlFNdVccFZC0WUdF4LyNmtKpURlDMuh5V1RaDB5kWdGsTm6mu4O3n3udYhy4_beppcyq6BkrOQMUgqmVOtdCF4bOfh-p_xBApWjVjlplUmrHLXKKjHZxIRhtKb93-X_oW_w_HwQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2918338631</pqid></control><display><type>article</type><title>Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism</title><source>Springer Nature - Complete Springer Journals</source><source>ProQuest Central UK/Ireland</source><source>ProQuest Central</source><creator>Guo, Dashan ; Li, Wei ; Fang, Xiangzhong</creator><creatorcontrib>Guo, Dashan ; Li, Wei ; Fang, Xiangzhong</creatorcontrib><description>To generate a natural language description for videos, there has been tremendous interest in developing deep neural networks with the integration of temporal structures in different categories. Considering the spatial and temporal domains inherent in video frames, we contend that the video dynamics and the spatio-temporal contexts are both important for captioning, which correspond to two different temporal structures. However, while the video dynamics is well investigated, the spatio-temporal contexts have not been given sufficient attention. In this paper, we take both structures into account and propose a novel recurrent convolution model for captioning. Firstly, for a comprehensive and detailed representation, we propose to aggregate the local and global spatio-temporal contexts in the recurrent convolution networks. Secondly, to capture much subtler temporal dynamics, the channel attention mechanism is introduced and it helps to understand the involvement of the frame feature maps with the captioning process. Finally, a qualitative comparison with several variants of our model demonstrates the effectiveness of incorporating these two structures. Moreover, experiments on YouTube2Text dataset have shown that the proposed method achieves competitive performance to other state-of-the-art methods.</description><identifier>ISSN: 1370-4621</identifier><identifier>EISSN: 1573-773X</identifier><identifier>DOI: 10.1007/s11063-017-9591-9</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Artificial Intelligence ; Artificial neural networks ; Back propagation ; Classification ; Complex Systems ; Computational Intelligence ; Computer Science ; Convolution ; Deep learning ; Dynamic structural analysis ; Feature maps ; Investigations ; Neural networks ; Semantics</subject><ispartof>Neural processing letters, 2017-08, Vol.46 (1), p.313-328</ispartof><rights>Springer Science+Business Media New York 2017</rights><rights>Springer Science+Business Media New York 2017.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c316t-f437a06e6af6bba4690e0e6763815d2d67706c2b8aa79ff341c6d55e1f4542fa3</citedby><cites>FETCH-LOGICAL-c316t-f437a06e6af6bba4690e0e6763815d2d67706c2b8aa79ff341c6d55e1f4542fa3</cites><orcidid>0000-0001-6833-8882</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s11063-017-9591-9$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2918338631?pq-origsite=primo$$EHTML$$P50$$Gproquest$$H</linktohtml><link.rule.ids>314,777,781,21369,27905,27906,33725,41469,42538,43786,51300,64364,64368,72218</link.rule.ids></links><search><creatorcontrib>Guo, Dashan</creatorcontrib><creatorcontrib>Li, Wei</creatorcontrib><creatorcontrib>Fang, Xiangzhong</creatorcontrib><title>Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism</title><title>Neural processing letters</title><addtitle>Neural Process Lett</addtitle><description>To generate a natural language description for videos, there has been tremendous interest in developing deep neural networks with the integration of temporal structures in different categories. Considering the spatial and temporal domains inherent in video frames, we contend that the video dynamics and the spatio-temporal contexts are both important for captioning, which correspond to two different temporal structures. However, while the video dynamics is well investigated, the spatio-temporal contexts have not been given sufficient attention. In this paper, we take both structures into account and propose a novel recurrent convolution model for captioning. Firstly, for a comprehensive and detailed representation, we propose to aggregate the local and global spatio-temporal contexts in the recurrent convolution networks. Secondly, to capture much subtler temporal dynamics, the channel attention mechanism is introduced and it helps to understand the involvement of the frame feature maps with the captioning process. Finally, a qualitative comparison with several variants of our model demonstrates the effectiveness of incorporating these two structures. Moreover, experiments on YouTube2Text dataset have shown that the proposed method achieves competitive performance to other state-of-the-art methods.</description><subject>Artificial Intelligence</subject><subject>Artificial neural networks</subject><subject>Back propagation</subject><subject>Classification</subject><subject>Complex Systems</subject><subject>Computational Intelligence</subject><subject>Computer Science</subject><subject>Convolution</subject><subject>Deep learning</subject><subject>Dynamic structural analysis</subject><subject>Feature maps</subject><subject>Investigations</subject><subject>Neural networks</subject><subject>Semantics</subject><issn>1370-4621</issn><issn>1573-773X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2017</creationdate><recordtype>article</recordtype><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNp1kMtKxDAUhoMoOI4-gLuA62hO0ybtcijeYMTFjOIupG2iHWaSmmTAeXtTqrhydS783znwIXQJ9BooFTcBgHJGKAhSFRWQ6gjNoBCMCMHejlPPBCU5z-AUnYWwoTRRGZ0hW6sh7n1v3_Fa7wbn1Ravot-3aakDNs7j177TDo-53tkx2BzwalBpIvEXqZ2N-isGrGyH6w9lrd7iRYzajhB-0m3a9WF3jk6M2gZ98VPn6OXudl0_kOXz_WO9WJKWAY_E5EwoyjVXhjeNynlFNdVccFZC0WUdF4LyNmtKpURlDMuh5V1RaDB5kWdGsTm6mu4O3n3udYhy4_beppcyq6BkrOQMUgqmVOtdCF4bOfh-p_xBApWjVjlplUmrHLXKKjHZxIRhtKb93-X_oW_w_HwQ</recordid><startdate>20170801</startdate><enddate>20170801</enddate><creator>Guo, Dashan</creator><creator>Li, Wei</creator><creator>Fang, Xiangzhong</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>8FE</scope><scope>8FG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>P5Z</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PSYQQ</scope><orcidid>https://orcid.org/0000-0001-6833-8882</orcidid></search><sort><creationdate>20170801</creationdate><title>Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism</title><author>Guo, Dashan ; Li, Wei ; Fang, Xiangzhong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c316t-f437a06e6af6bba4690e0e6763815d2d67706c2b8aa79ff341c6d55e1f4542fa3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2017</creationdate><topic>Artificial Intelligence</topic><topic>Artificial neural networks</topic><topic>Back propagation</topic><topic>Classification</topic><topic>Complex Systems</topic><topic>Computational Intelligence</topic><topic>Computer Science</topic><topic>Convolution</topic><topic>Deep learning</topic><topic>Dynamic structural analysis</topic><topic>Feature maps</topic><topic>Investigations</topic><topic>Neural networks</topic><topic>Semantics</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Guo, Dashan</creatorcontrib><creatorcontrib>Li, Wei</creatorcontrib><creatorcontrib>Fang, Xiangzhong</creatorcontrib><collection>CrossRef</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest One Psychology</collection><jtitle>Neural processing letters</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Guo, Dashan</au><au>Li, Wei</au><au>Fang, Xiangzhong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism</atitle><jtitle>Neural processing letters</jtitle><stitle>Neural Process Lett</stitle><date>2017-08-01</date><risdate>2017</risdate><volume>46</volume><issue>1</issue><spage>313</spage><epage>328</epage><pages>313-328</pages><issn>1370-4621</issn><eissn>1573-773X</eissn><abstract>To generate a natural language description for videos, there has been tremendous interest in developing deep neural networks with the integration of temporal structures in different categories. Considering the spatial and temporal domains inherent in video frames, we contend that the video dynamics and the spatio-temporal contexts are both important for captioning, which correspond to two different temporal structures. However, while the video dynamics is well investigated, the spatio-temporal contexts have not been given sufficient attention. In this paper, we take both structures into account and propose a novel recurrent convolution model for captioning. Firstly, for a comprehensive and detailed representation, we propose to aggregate the local and global spatio-temporal contexts in the recurrent convolution networks. Secondly, to capture much subtler temporal dynamics, the channel attention mechanism is introduced and it helps to understand the involvement of the frame feature maps with the captioning process. Finally, a qualitative comparison with several variants of our model demonstrates the effectiveness of incorporating these two structures. Moreover, experiments on YouTube2Text dataset have shown that the proposed method achieves competitive performance to other state-of-the-art methods.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s11063-017-9591-9</doi><tpages>16</tpages><orcidid>https://orcid.org/0000-0001-6833-8882</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 1370-4621
ispartof Neural processing letters, 2017-08, Vol.46 (1), p.313-328
issn 1370-4621
1573-773X
language eng
recordid cdi_proquest_journals_2918338631
source Springer Nature - Complete Springer Journals; ProQuest Central UK/Ireland; ProQuest Central
subjects Artificial Intelligence
Artificial neural networks
Back propagation
Classification
Complex Systems
Computational Intelligence
Computer Science
Convolution
Deep learning
Dynamic structural analysis
Feature maps
Investigations
Neural networks
Semantics
title Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-20T02%3A39%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Capturing%20Temporal%20Structures%20for%20Video%20Captioning%20by%20Spatio-temporal%20Contexts%20and%20Channel%20Attention%20Mechanism&rft.jtitle=Neural%20processing%20letters&rft.au=Guo,%20Dashan&rft.date=2017-08-01&rft.volume=46&rft.issue=1&rft.spage=313&rft.epage=328&rft.pages=313-328&rft.issn=1370-4621&rft.eissn=1573-773X&rft_id=info:doi/10.1007/s11063-017-9591-9&rft_dat=%3Cproquest_cross%3E2918338631%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2918338631&rft_id=info:pmid/&rfr_iscdi=true