Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based Human Action Recognition
Skeleton-based action recognition is a central task in human-computer interaction. However, most previous methods suffer from two issues: (i) semantic ambiguity arising from spatial-temporal information mixture; and (ii) overlooking the explicit exploitation of the latent data distributions (i.e., t...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Skeleton-based action recognition is a central task in human-computer
interaction. However, most previous methods suffer from two issues: (i)
semantic ambiguity arising from spatial-temporal information mixture; and (ii)
overlooking the explicit exploitation of the latent data distributions (i.e.,
the intra-class variations and inter-class relations), thereby leading to
sub-optimum solutions of the skeleton encoders. To mitigate this, we propose a
spatial-temporal decoupling contrastive learning (STD-CL) framework to obtain
discriminative and semantically distinct representations from the sequences,
which can be incorporated into various previous skeleton encoders and can be
removed when testing. Specifically, we decouple the global features into
spatial-specific and temporal-specific features to reduce the spatial-temporal
coupling of features. Furthermore, to explicitly exploit the latent data
distributions, we employ the attentive features to contrastive learning, which
models the cross-sequence semantic relations by pulling together the features
from the positive pairs and pushing away the negative pairs. Extensive
experiments show that STD-CL with four various skeleton encoders (HCN, 2S-AGCN,
CTR-GCN, and Hyperformer) achieves solid improvements on NTU60, NTU120, and
NW-UCLA benchmarks. The code will be released soon. |
---|---|
DOI: | 10.48550/arxiv.2312.15144 |