Learning motion representation for real-time spatio-temporal action localization

•Proposing a novel method to localize human actions in videos spatio-temporally with integrating an optical flow subnet. The designed new architecture is able to perform action localization and optical flow estimation jointly in an end-to-end manner.•The interaction between the action detector and f...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Pattern recognition 2020-07, Vol.103, p.107312, Article 107312
Hauptverfasser: Zhang, Dejun, He, Linchao, Tu, Zhigang, Zhang, Shifu, Han, Fei, Yang, Boxiong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•Proposing a novel method to localize human actions in videos spatio-temporally with integrating an optical flow subnet. The designed new architecture is able to perform action localization and optical flow estimation jointly in an end-to-end manner.•The interaction between the action detector and flow subnet enables the detector to learn parameters from appearance and motion simultaneously, and guiding flow subnet to compute task-specific optical flow.•Exploiting an effective fusion method to fuse appearance and optical flow deep features in a multi-scale fashion. The multi-scale temporal and spatial features are combined interactively to model a more discriminative spatio-temporal action representation.•The presented method achieves real-time computation at the first time with the usage of both RGB appearance and optical flow. It outperforms the state-of-the-art method [1] by 1.3% in accuracy. The current deep learning based spatio-temporal action localization methods that using motion information (predominated is optical flow) obtain the state-of-the-art performance. However, since the optical flow is pre-computed, leading to these methods face two problems – the computational efficiency is low and the whole network is not end-to-end trainable. We propose a novel spatio-temporal action localization approach with an integrated optical flow sub-network to address these two issues. Specifically, our designed flow subnet can estimate optical flow efficiently and accurately by using multiple consecutive RGB frames rather than two adjacent frames in a deep network, simultaneously, action localization is implemented in the same network interactive with flow computation end-to-end. To faster the speed, we exploit a neural network based feature fusion method in a pyramid hierarchical manner. It fuses spatial and temporal features at different granularities via combination function (i.e. concatenation) and point-wise convolution to obtain multiscale spatio-temporal action features. Experimental results on three publicly available datasets, e.g. UCF101-24, JHMDB and AVA show that with both RGB appearance and optical flow cues, the proposed method gets the state-of-the-art performance in both efficiency and accuracy. Noticeably, it gets a significant improvement on efficiency. Compared to the currently most efficient method, it is 1.9 times faster in the running speed and 1.3% video-mAP more accurate on the UCF101-24. Our proposed method reaches real-time computati
ISSN:0031-3203
1873-5142
DOI:10.1016/j.patcog.2020.107312