A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis
Acoustic features play an important role in improving the quality of the synthesised speech. Currently, the Mel spectrogram is a widely employed acoustic feature in most acoustic models. However, due to the fine-grained loss caused by its Fourier transform process, the clarity of speech synthesised...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Acoustic features play an important role in improving the quality of the
synthesised speech. Currently, the Mel spectrogram is a widely employed
acoustic feature in most acoustic models. However, due to the fine-grained loss
caused by its Fourier transform process, the clarity of speech synthesised by
Mel spectrogram is compromised in mutant signals. In order to obtain a more
detailed Mel spectrogram, we propose a Mel spectrogram enhancement paradigm
based on the continuous wavelet transform (CWT). This paradigm introduces an
additional task: a more detailed wavelet spectrogram, which like the
post-processing network takes as input the Mel spectrogram output by the
decoder. We choose Tacotron2 and Fastspeech2 for experimental validation in
order to test autoregressive (AR) and non-autoregressive (NAR) speech systems,
respectively. The experimental results demonstrate that the speech synthesised
using the model with the Mel spectrogram enhancement paradigm exhibits higher
MOS, with an improvement of 0.14 and 0.09 compared to the baseline model,
respectively. These findings provide some validation for the universality of
the enhancement paradigm, as they demonstrate the success of the paradigm in
different architectures. |
---|---|
DOI: | 10.48550/arxiv.2406.12164 |