Exploring Latent Sign Language Representations with Isolated Signs, Sentences and In-the-Wild Data

Unsupervised representation learning offers a promising way of utilising large unannotated sign language resources found on the Internet. In this paper, a representation learning model, VQ-VAE, is trained to learn a codebook of motion primitives from sign language data. For training, we use isolated...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Malmberg, Fredrik, Klezovich, Anna, Mesch, Johanna, Beskow, Jonas
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Pose Codebook Representation Learning sign language data VQ-VAE
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Unsupervised representation learning offers a promising way of utilising large unannotated sign language resources found on the Internet. In this paper, a representation learning model, VQ-VAE, is trained to learn a codebook of motion primitives from sign language data. For training, we use isolated signs and sentences from a sign language dictionary. Three models are trained: one on isolated signs, one on sentences, and one mixed model. We test these models by comparing how well they are able to reconstruct held-out data from the dictionary, as well as an in-the-wild dataset based on sign language videos from YouTube. These data are characterized by less formal and more expressive signing than the dictionary items. Results show that the isolated sign model yields considerably higher reconstruction loss for the YouTube dataset, while the sentence model performs the best on this data. Further, an analysis of codebook usage reveals that the set of codes used by isolated signs and sentences differ significantly. In order to further understand the different characters of the datasets, we carry out an analysis of the velocity profiles, which reveals that signing data in-the-wild has a much higher average velocity than dictionary signs and sentences. We believe these differences also explain the large differences in reconstruction loss observed.