SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis
IEEE International Conference on Acoustics, Speech and Signal Processing, Apr 2024, Seoul (Korea), South Korea Generative adversarial network (GAN) models can synthesize highquality audio signals while ensuring fast sample generation. However, they are difficult to train and are prone to several iss...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | IEEE International Conference on Acoustics, Speech and Signal
Processing, Apr 2024, Seoul (Korea), South Korea Generative adversarial network (GAN) models can synthesize highquality audio
signals while ensuring fast sample generation. However, they are difficult to
train and are prone to several issues including mode collapse and divergence.
In this paper, we introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN,
which was initially devised for speech synthesis from mel spectrogram. In our
model, the training stability is enhanced by means of a forward diffusion
process which consists in injecting noise from a Gaussian distribution to both
real and fake samples before inputting them to the discriminator. We further
improve the model by exploiting a spectrally-shaped noise distribution with the
aim to make the discriminator's task more challenging. We then show the merits
of our proposed model for speech and music synthesis on several datasets. Our
experiments confirm that our model compares favorably in audio quality and
efficiency compared to several baselines. |
---|---|
DOI: | 10.48550/arxiv.2402.01753 |