Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks
Speech is a rich biometric signal that contains information about the identity, gender and emotional state of the speaker. In this work, we explore its potential to generate face images of a speaker by conditioning a Generative Adversarial Network (GAN) with raw speech input. We propose a deep neura...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Speech is a rich biometric signal that contains information about the
identity, gender and emotional state of the speaker. In this work, we explore
its potential to generate face images of a speaker by conditioning a Generative
Adversarial Network (GAN) with raw speech input. We propose a deep neural
network that is trained from scratch in an end-to-end fashion, generating a
face directly from the raw speech waveform without any additional identity
information (e.g reference image or one-hot encoding). Our model is trained in
a self-supervised approach by exploiting the audio and visual signals naturally
aligned in videos. With the purpose of training from video data, we present a
novel dataset collected for this work, with high-quality videos of youtubers
with notable expressiveness in both the speech and visual signals. |
---|---|
DOI: | 10.48550/arxiv.1903.10195 |