Cross-Model Cross-Stream Learning for Self-Supervised Human Action Recognition

Considering the instance-level discriminative ability, contrastive learning methods, including MoCo and SimCLR, have been adapted from the original image representation learning task to solve the self-supervised skeleton-based action recognition task. These methods usually use multiple data streams...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on human-machine systems 2024-12, Vol.54 (6), p.743-752
Hauptverfasser:	Liu, Mengyuan, Liu, Hong, Guo, Tianyu
Format:	Artikel
Sprache:	eng
Schlagworte:	Contrastive learning Data mining Feature extraction Human activity recognition Human-machine systems Multistream Representation learning self-supervised learning Skeleton skeleton-based action recognition Spatiotemporal phenomena
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Considering the instance-level discriminative ability, contrastive learning methods, including MoCo and SimCLR, have been adapted from the original image representation learning task to solve the self-supervised skeleton-based action recognition task. These methods usually use multiple data streams (i.e., joint, motion, and bone) for ensemble learning, meanwhile, how to construct a discriminative feature space within a single stream and effectively aggregate the information from multiple streams remains an open problem. To this end, this article first applies a new contrastive learning method called bootstrap your own latent (BYOL) to learn from skeleton data, and then formulate SkeletonBYOL as a simple yet effective baseline for self-supervised skeleton-based action recognition. Inspired by SkeletonBYOL, this article further presents a cross-model and cross-stream (CMCS) framework. This framework combines cross-model adversarial learning (CMAL) and cross-stream collaborative learning (CSCL). Specifically, CMAL learns single-stream representation by cross-model adversarial loss to obtain more discriminative features. To aggregate and interact with multistream information, CSCL is designed by generating similarity pseudolabel of ensemble learning as supervision and guiding feature generation for individual streams. Extensive experiments on three datasets verify the complementary properties between CMAL and CSCL and also verify that the proposed method can achieve better results than state-of-the-art methods using various evaluation protocols.
ISSN:	2168-2291 2168-2305
DOI:	10.1109/THMS.2024.3467334