Skeleton2vec: A Self-supervised Learning Framework with Contextualized Target Representations for Skeleton Sequence
Self-supervised pre-training paradigms have been extensively explored in the field of skeleton-based action recognition. In particular, methods based on masked prediction have pushed the performance of pre-training to a new height. However, these methods take low-level features, such as raw joint co...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Self-supervised pre-training paradigms have been extensively explored in the
field of skeleton-based action recognition. In particular, methods based on
masked prediction have pushed the performance of pre-training to a new height.
However, these methods take low-level features, such as raw joint coordinates
or temporal motion, as prediction targets for the masked regions, which is
suboptimal. In this paper, we show that using high-level contextualized
features as prediction targets can achieve superior performance. Specifically,
we propose Skeleton2vec, a simple and efficient self-supervised 3D action
representation learning framework, which utilizes a transformer-based teacher
encoder taking unmasked training samples as input to create latent
contextualized representations as prediction targets. Benefiting from the
self-attention mechanism, the latent representations generated by the teacher
encoder can incorporate the global context of the entire training samples,
leading to a richer training task. Additionally, considering the high temporal
correlations in skeleton sequences, we propose a motion-aware tube masking
strategy which divides the skeleton sequence into several tubes and performs
persistent masking within each tube based on motion priors, thus forcing the
model to build long-range spatio-temporal connections and focus on
action-semantic richer regions. Extensive experiments on NTU-60, NTU-120, and
PKU-MMD datasets demonstrate that our proposed Skeleton2vec outperforms
previous methods and achieves state-of-the-art results. |
---|---|
DOI: | 10.48550/arxiv.2401.00921 |