Tohjm-Trained Multiscale Spatial Temporal Graph Convolutional Neural Network for Semi-Supervised Skeletal Action Recognition

In recent years, spatial-temporal graph convolutional networks have played an increasingly important role in skeleton-based human action recognition. However, there are still three major limitations to most ST-GCN-based approaches: (1) They only use a single joint scale to extract action features, o...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Electronics (Basel) 2022-11, Vol.11 (21), p.3498
Hauptverfasser: Gou, Ruru, Yang, Wenzhu, Luo, Zifei, Yuan, Yunfeng, Li, Andong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In recent years, spatial-temporal graph convolutional networks have played an increasingly important role in skeleton-based human action recognition. However, there are still three major limitations to most ST-GCN-based approaches: (1) They only use a single joint scale to extract action features, or process joint and skeletal information separately. As a result, action features cannot be extracted dynamically through the mutual directivity between the scales. (2) These models treat the contributions of all joints equally in training, which neglects the problem that some joints with difficult loss-reduction are critical joints in network training. (3) These networks rely heavily on a large amount of labeled data, which remains costly. To address these problems, we propose a Tohjm-trained multiscale spatial-temporal graph convolutional neural network for semi-supervised action recognition, which contains three parts: encoder, decoder and classifier. The encoder’s core is a correlated joint–bone–body-part fusion spatial-temporal graph convolutional network that allows the network to learn more stable action features between coarse and fine scales. The decoder uses a self-supervised training method with a motion prediction head, which enables the network to extract action features using unlabeled data so that the network can achieve semi-supervised learning. In addition, the network is also capable of fully supervised learning with the encoder, decoder and classifier. Our proposed time-level online hard joint mining strategy is also used in the decoder training process, which allows the network to focus on hard training joints and improve the overall network performance. Experimental results on the NTU-RGB + D dataset and the Kinetics-skeleton dataset show that the improved model achieves good performance for action recognition based on semi-supervised training, and is also applicable to the fully supervised approach.
ISSN:2079-9292
2079-9292
DOI:10.3390/electronics11213498