Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video
Synthesizing realistic videos according to a given speech is still an open challenge. Previous works have been plagued by issues such as inaccurate lip shape generation and poor image quality. The key reason is that only motions and appearances on limited facial areas (e.g., lip area) are mainly dri...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Synthesizing realistic videos according to a given speech is still an open
challenge. Previous works have been plagued by issues such as inaccurate lip
shape generation and poor image quality. The key reason is that only motions
and appearances on limited facial areas (e.g., lip area) are mainly driven by
the input speech. Therefore, directly learning a mapping function from speech
to the entire head image is prone to ambiguity, particularly when using a short
video for training. We thus propose a decomposition-synthesis-composition
framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive
and speech-insensitive motion/appearance to facilitate effective learning from
limited training data, resulting in the generation of natural-looking videos.
First, given a fixed head pose (i.e., canonical space), we present a
speech-driven implicit model for lip image generation which concentrates on
learning speech-sensitive motion and appearance. Next, to model the major
speech-insensitive motion (i.e., head movement), we introduce a geometry-aware
mutual explicit mapping (GAMEM) module that establishes geometric mappings
between different head poses. This allows us to paste generated lip images at
the canonical space onto head images with arbitrary poses and synthesize
talking videos with natural head movements. In addition, a Blend-Net and a
contrastive sync loss are introduced to enhance the overall synthesis
performance. Quantitative and qualitative results on three benchmarks
demonstrate that our model can be trained by a video of just a few minutes in
length and achieve state-of-the-art performance in both visual quality and
speech-visual synchronization. Code: https://github.com/CVMI-Lab/Speech2Lip. |
---|---|
DOI: | 10.48550/arxiv.2309.04814 |