Multi-Stream and Enhanced Spatial-Temporal Graph Convolution Network for Skeleton-Based Action Recognition
In skeleton-based human action recognition, spatial-temporal graph convolution networks (ST-GCNs) have achieved remarkable performances recently. However, how to explore more discriminative spatial and temporal features is still an open problem. The temporal graph convolution of the traditional ST-G...
Gespeichert in:
Veröffentlicht in: | IEEE access 2020, Vol.8, p.97757-97770 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In skeleton-based human action recognition, spatial-temporal graph convolution networks (ST-GCNs) have achieved remarkable performances recently. However, how to explore more discriminative spatial and temporal features is still an open problem. The temporal graph convolution of the traditional ST-GCNs utilizes only one fixed kernel which cannot completely cover all the important stages of each action execution. Besides, the spatial and temporal graph convolution layers (GCLs) are serial connected, which mixes information of different domains and limits the feature extraction capability. In addition, the input features like joints, bones, and their motions are modeled in existing methods, but more input features are needed for better performance. To this end, this article proposes a novel multi-stream and enhanced spatial-temporal graph convolution network (MS-ESTGCN). For each basic block of MS-ESTGCN, densely connected multiple temporal GCLs with different kernel sizes are employed to aggregate more temporal features. To eliminate the adverse impact of information mixing, an additional spatial GCL branch is added to the block and the spatial features can be enhanced. Furthermore, we extend the input features by employing relative positions of joints and bones. Consequently, there are totally six data modalities (joints, bones, their motions and relative positions) that can be fed into the network independently with a six-stream paradigm. The proposed method is evaluated on two large scale datasets: NTU-RGB+D and Kinetics-Skeleton. The experimental results show that our method using only two data modalities delivers state-of-the-art performance, and our methods using four and six data modalities further exceed other methods with a significant margin. |
---|---|
ISSN: | 2169-3536 2169-3536 |
DOI: | 10.1109/ACCESS.2020.2996779 |