Video Saliency Prediction Using Spatiotemporal Residual Attentive Networks

This paper proposes a novel residual attentive learning network architecture for predicting dynamic eye-fixation maps. The proposed model emphasizes two essential issues, i.e., effective spatiotemporal feature integration and multi-scale saliency learning. For the first problem, appearance and motio...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on image processing 2020-01, Vol.29, p.1113-1126
Hauptverfasser:	Lai, Qiuxia, Wang, Wenguan, Sun, Hanqiu, Shen, Jianbing
Format:	Artikel
Sprache:	eng
Schlagworte:	attention mechanism Computational modeling Computer Science Computer Science, Artificial Intelligence Data exchange Data models deep learning Domains Dynamic eye-fixation prediction Dynamics Engineering Engineering, Electrical & Electronic Learning Multilayers Predictions Predictive models residual attentive learning Salience Science & Technology Spatiotemporal phenomena Task analysis Technology video saliency Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This paper proposes a novel residual attentive learning network architecture for predicting dynamic eye-fixation maps. The proposed model emphasizes two essential issues, i.e., effective spatiotemporal feature integration and multi-scale saliency learning. For the first problem, appearance and motion streams are tightly coupled via dense residual cross connections, which integrate appearance information with multi-layer, comprehensive motion features in a residual and dense way. Beyond traditional two-stream models learning appearance and motion features separately, such design allows early, multi-path information exchange between different domains, leading to a unified and powerful spatiotemporal learning architecture. For the second one, we propose a composite attention mechanism that learns multi-scale local attentions and global attention priors end-to-end. It is used for enhancing the fused spatiotemporal features via emphasizing important features in multi-scales. A lightweight convolutional Gated Recurrent Unit (convGRU), which is flexible for small training data situation, is used for long-term temporal characteristics modeling. Extensive experiments over four benchmark datasets clearly demonstrate the advantage of the proposed video saliency model over other competitors and the effectiveness of each component of our network. Our code and all the results will be available at https://github.com/ashleylqx/STRA-Net.
ISSN:	1057-7149 1941-0042
DOI:	10.1109/TIP.2019.2936112