Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism

To generate a natural language description for videos, there has been tremendous interest in developing deep neural networks with the integration of temporal structures in different categories. Considering the spatial and temporal domains inherent in video frames, we contend that the video dynamics...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Neural processing letters 2017-08, Vol.46 (1), p.313-328
Hauptverfasser:	Guo, Dashan, Li, Wei, Fang, Xiangzhong
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial Intelligence Artificial neural networks Back propagation Classification Complex Systems Computational Intelligence Computer Science Convolution Deep learning Dynamic structural analysis Feature maps Investigations Neural networks Semantics
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	To generate a natural language description for videos, there has been tremendous interest in developing deep neural networks with the integration of temporal structures in different categories. Considering the spatial and temporal domains inherent in video frames, we contend that the video dynamics and the spatio-temporal contexts are both important for captioning, which correspond to two different temporal structures. However, while the video dynamics is well investigated, the spatio-temporal contexts have not been given sufficient attention. In this paper, we take both structures into account and propose a novel recurrent convolution model for captioning. Firstly, for a comprehensive and detailed representation, we propose to aggregate the local and global spatio-temporal contexts in the recurrent convolution networks. Secondly, to capture much subtler temporal dynamics, the channel attention mechanism is introduced and it helps to understand the involvement of the frame feature maps with the captioning process. Finally, a qualitative comparison with several variants of our model demonstrates the effectiveness of incorporating these two structures. Moreover, experiments on YouTube2Text dataset have shown that the proposed method achieves competitive performance to other state-of-the-art methods.
ISSN:	1370-4621 1573-773X
DOI:	10.1007/s11063-017-9591-9