Towards the extraction of robust sign embeddings for low resource sign language recognition
Isolated Sign Language Recognition (SLR) has mostly been applied on datasets containing signs executed slowly and clearly by a limited group of signers. In real-world scenarios, however, we are met with challenging visual conditions, coarticulated signing, small datasets, and the need for signer ind...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Isolated Sign Language Recognition (SLR) has mostly been applied on datasets
containing signs executed slowly and clearly by a limited group of signers. In
real-world scenarios, however, we are met with challenging visual conditions,
coarticulated signing, small datasets, and the need for signer independent
models. To tackle this difficult problem, we require a robust feature extractor
to process the sign language videos. One could expect human pose estimators to
be ideal candidates. However, due to a domain mismatch with their training sets
and challenging poses in sign language, they lack robustness on sign language
data and image-based models often still outperform keypoint-based models.
Furthermore, whereas the common practice of transfer learning with image-based
models yields even higher accuracy, keypoint-based models are typically trained
from scratch on every SLR dataset. These factors limit their usefulness for
SLR. From the existing literature, it is also not clear which, if any, pose
estimator performs best for SLR. We compare the three most popular pose
estimators for SLR: OpenPose, MMPose and MediaPipe. We show that through
keypoint normalization, missing keypoint imputation, and learning a pose
embedding, we can obtain significantly better results and enable transfer
learning. We show that keypoint-based embeddings contain cross-lingual
features: they can transfer between sign languages and achieve competitive
performance even when fine-tuning only the classifier layer of an SLR model on
a target sign language. We furthermore achieve better performance using
fine-tuned transferred embeddings than models trained only on the target sign
language. The embeddings can also be learned in a multilingual fashion. The
application of these embeddings could prove particularly useful for low
resource sign languages in the future. |
---|---|
DOI: | 10.48550/arxiv.2306.17558 |