Symmetrical Enhanced Fusion Network for Skeleton-Based Action Recognition

A novel method for skeleton-based action recognition by fusing multi-level spatial features and multi-level temporal features is proposed in this article. Recently, Graph Convolutional Network (GCN) for skeleton-based action recognition has attracted the eyes of many researchers and has a great perf...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on circuits and systems for video technology 2021-11, Vol.31 (11), p.4394-4408
Hauptverfasser: Kong, Jun, Deng, Haoyang, Jiang, Min
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:A novel method for skeleton-based action recognition by fusing multi-level spatial features and multi-level temporal features is proposed in this article. Recently, Graph Convolutional Network (GCN) for skeleton-based action recognition has attracted the eyes of many researchers and has a great performance in the field of action recognition. But most of them focus on changing architecture of single-stream network and only use simple methods like average fusion to fuse different forms of skeleton data. In this article, we shift the focus to the problem that insufficient interactions between the different forms of features for that networks are unable to fully capture efficient information from skeleton data. To tackle this problem, we propose a multi-stream network called Symmetrical Enhanced Fusion Network (SEFN). The network is composed of a spatial stream, a temporal stream and a fusion stream. The spatial stream extracts spatial features from skeleton data by GCN. The temporal stream is able to extract temporal features from skeleton data with the help of the embedded Motion Sequence Calculation Algorithm. The fusion stream provides an early fusion method and extra fusion information for the whole network. It gathers multi-level features from two feature extractions and fuses them with the Multi-perspective Attention Fusion Module (MPAFM) we propose. The MPAFM enables different forms of data to enhance each other and can strengthen feature extractions. In the final, we generalize the skeleton data from joint data to bone data and evaluate our network in three large-scale benchmarks: NTU-RGBD, NTU-RGBD 120 and Kinetics-Skeleton. Experiment results demonstrate that our method achieves competitive performance.
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2021.3050807