STSD: spatial–temporal semantic decomposition transformer for skeleton-based action recognition

Skeleton-based human action recognition has attracted widespread interest, as skeleton data are extremely robust to changes in lighting, camera views, and complex backgrounds. In recent studies, transformer-based methods are proposed for the encoding of the latent information underlying the 3D skele...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Multimedia systems 2024-02, Vol.30 (1), Article 43
Hauptverfasser:	Cui, Hu, Hayama, Tessai
Format:	Artikel
Sprache:	eng
Schlagworte:	Body parts Computer Communication Networks Computer Graphics Computer Science Cryptology Data Storage Representation Decomposition Human activity recognition Modules Multimedia Information Systems Operating Systems Regular Paper Semantics Transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Skeleton-based human action recognition has attracted widespread interest, as skeleton data are extremely robust to changes in lighting, camera views, and complex backgrounds. In recent studies, transformer-based methods are proposed for the encoding of the latent information underlying the 3D skeleton. These methods focus on modeling the relationships of joints in skeleton sequences without any predefined graphical information by self-attention mechanism and have been proven to be effective. But there are two challenging issues ignored in these methods: the utilization of human body-related and dynamic semantic information. In this work, we propose a novel spatial–temporal semantic decomposition transformer network (STSD-TR) that models dependencies between joints with body parts semantics and sub-action semantics. In our STSD-TR, a body parts semantic decomposition module (BPSD) is used to extract body parts semantic information from 3D coordinates of joints, and then a temporal-local spatial–temporal attention module (TL-STA) is used to capture the relationships of joints in several consecutive frames which can be understood as local sub-action semantic information. Finally, a global spatial–temporal module (GST) is used to aggregate the temporal-local features and generate a global spatial–temporal representation. Moreover, we design a BodyParts-Mix strategy which mixes body parts from two people in a unique manner and further boosts the performance. Compared with the state-of-the-art methods, our method achieves competitive performance on two large-scale datasets.
ISSN:	0942-4962 1432-1882
DOI:	10.1007/s00530-023-01251-2