SignDiff: Diffusion Models for American Sign Language Production
In this paper, we propose a dual-condition diffusion pre-training model named SignDiff that can generate human sign language speakers from a skeleton pose. SignDiff has a novel Frame Reinforcement Network called FR-Net, similar to dense human pose estimation work, which enhances the correspondence b...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this paper, we propose a dual-condition diffusion pre-training model named
SignDiff that can generate human sign language speakers from a skeleton pose.
SignDiff has a novel Frame Reinforcement Network called FR-Net, similar to
dense human pose estimation work, which enhances the correspondence between
text lexical symbols and sign language dense pose frames, reduces the
occurrence of multiple fingers in the diffusion model. In addition, we propose
a new method for American Sign Language Production (ASLP), which can generate
ASL skeletal pose videos from text input, integrating two new improved modules
and a new loss function to improve the accuracy and quality of sign language
skeletal posture and enhance the ability of the model to train on large-scale
data. We propose the first baseline for ASL production and report the scores of
17.19 and 12.85 on BLEU-4 on the How2Sign dev/test sets. We evaluated our model
on the previous mainstream dataset PHOENIX14T, and our method achieved the SOTA
results. In addition, our image quality far exceeds all previous results by 10
percentage points in terms of SSIM. |
---|---|
DOI: | 10.48550/arxiv.2308.16082 |