ObamaNet: Photo-realistic lip-sync from text
We present ObamaNet, the first architecture that generates both audio and synchronized photo-realistic lip-sync videos from any new text. Contrary to other published lip-sync approaches, ours is only composed of fully trainable neural modules and does not rely on any traditional computer graphics me...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We present ObamaNet, the first architecture that generates both audio and
synchronized photo-realistic lip-sync videos from any new text. Contrary to
other published lip-sync approaches, ours is only composed of fully trainable
neural modules and does not rely on any traditional computer graphics methods.
More precisely, we use three main modules: a text-to-speech network based on
Char2Wav, a time-delayed LSTM to generate mouth-keypoints synced to the audio,
and a network based on Pix2Pix to generate the video frames conditioned on the
keypoints. |
---|---|
DOI: | 10.48550/arxiv.1801.01442 |