iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform
In recent text-to-speech synthesis and voice conversion systems, a mel-spectrogram is commonly applied as an intermediate representation, and the necessity for a mel-spectrogram vocoder is increasing. A mel-spectrogram vocoder must solve three inverse problems: recovery of the original-scale magnitu...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In recent text-to-speech synthesis and voice conversion systems, a
mel-spectrogram is commonly applied as an intermediate representation, and the
necessity for a mel-spectrogram vocoder is increasing. A mel-spectrogram
vocoder must solve three inverse problems: recovery of the original-scale
magnitude spectrogram, phase reconstruction, and frequency-to-time conversion.
A typical convolutional mel-spectrogram vocoder solves these problems jointly
and implicitly using a convolutional neural network, including temporal
upsampling layers, when directly calculating a raw waveform. Such an approach
allows skipping redundant processes during waveform synthesis (e.g., the direct
reconstruction of high-dimensional original-scale spectrograms). By contrast,
the approach solves all problems in a black box and cannot effectively employ
the time-frequency structures existing in a mel-spectrogram. We thus propose
iSTFTNet, which replaces some output-side layers of the mel-spectrogram vocoder
with the inverse short-time Fourier transform (iSTFT) after sufficiently
reducing the frequency dimension using upsampling layers, reducing the
computational cost from black-box modeling and avoiding redundant estimations
of high-dimensional spectrograms. During our experiments, we applied our ideas
to three HiFi-GAN variants and made the models faster and more lightweight with
a reasonable speech quality. Audio samples are available at
https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/istftnet/. |
---|---|
DOI: | 10.48550/arxiv.2203.02395 |