Learning motion representation for real-time spatio-temporal action localization
•Proposing a novel method to localize human actions in videos spatio-temporally with integrating an optical flow subnet. The designed new architecture is able to perform action localization and optical flow estimation jointly in an end-to-end manner.•The interaction between the action detector and f...
Gespeichert in:
Veröffentlicht in: | Pattern recognition 2020-07, Vol.103, p.107312, Article 107312 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | •Proposing a novel method to localize human actions in videos spatio-temporally with integrating an optical flow subnet. The designed new architecture is able to perform action localization and optical flow estimation jointly in an end-to-end manner.•The interaction between the action detector and flow subnet enables the detector to learn parameters from appearance and motion simultaneously, and guiding flow subnet to compute task-specific optical flow.•Exploiting an effective fusion method to fuse appearance and optical flow deep features in a multi-scale fashion. The multi-scale temporal and spatial features are combined interactively to model a more discriminative spatio-temporal action representation.•The presented method achieves real-time computation at the first time with the usage of both RGB appearance and optical flow. It outperforms the state-of-the-art method [1] by 1.3% in accuracy.
The current deep learning based spatio-temporal action localization methods that using motion information (predominated is optical flow) obtain the state-of-the-art performance. However, since the optical flow is pre-computed, leading to these methods face two problems – the computational efficiency is low and the whole network is not end-to-end trainable. We propose a novel spatio-temporal action localization approach with an integrated optical flow sub-network to address these two issues. Specifically, our designed flow subnet can estimate optical flow efficiently and accurately by using multiple consecutive RGB frames rather than two adjacent frames in a deep network, simultaneously, action localization is implemented in the same network interactive with flow computation end-to-end. To faster the speed, we exploit a neural network based feature fusion method in a pyramid hierarchical manner. It fuses spatial and temporal features at different granularities via combination function (i.e. concatenation) and point-wise convolution to obtain multiscale spatio-temporal action features. Experimental results on three publicly available datasets, e.g. UCF101-24, JHMDB and AVA show that with both RGB appearance and optical flow cues, the proposed method gets the state-of-the-art performance in both efficiency and accuracy. Noticeably, it gets a significant improvement on efficiency. Compared to the currently most efficient method, it is 1.9 times faster in the running speed and 1.3% video-mAP more accurate on the UCF101-24. Our proposed method reaches real-time computati |
---|---|
ISSN: | 0031-3203 1873-5142 |
DOI: | 10.1016/j.patcog.2020.107312 |