An asymmetrical-structure auto-encoder for unsupervised representation learning of skeleton sequences

In this paper, we propose a novel framework for unsupervised representation learning using a structure-asymmetrical auto-encoder in which a 2D-CNN-based encoder learns separable spatiotemporal representations in a low-dimensional feature space under the supervision of salient skeleton motion cues. T...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Computer vision and image understanding 2022-09, Vol.222, p.103491, Article 103491
Hauptverfasser: Zhou, Jiaxin, Komuro, Takashi
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In this paper, we propose a novel framework for unsupervised representation learning using a structure-asymmetrical auto-encoder in which a 2D-CNN-based encoder learns separable spatiotemporal representations in a low-dimensional feature space under the supervision of salient skeleton motion cues. This study addresses the problem of learning action representations of skeleton sequences. The network captures not only correlations of adjacent joints but also long-term motion dependencies by using the proposed unsupervised training, which leads to the advantage that similar movements are gathered around the same cluster, whereas different movements are gathered around distinct clusters. Our method is unsupervised and does not rely on annotations to associate skeleton sequences with actions. Experimental results clearly showed the effectiveness of the proposed representation learning, and improvements compared with skeleton-based generative learning methods. When the proposed network was fine-tuned with partial labeled data, our results still outperformed some fully supervised methods. •This paper proposes an asymmetrical-structure auto-encoder network using 2D-CNN as encoder and RNN as decoder to learn action features of skeleton sequences.•Salient skeleton motion cues are proposed to represent motion features as the label for network training.•The experimental results on NTU RGBD 60 show that the proposed method outperformed prior unsupervised state-of-the-art methods.
ISSN:1077-3142
1090-235X
DOI:10.1016/j.cviu.2022.103491