A multi-task learning speech synthesis optimization method based on CWT: a case study of Tacotron2

Text-to-speech synthesis plays an essential role in facilitating human-computer interaction. Currently, the predominant approach in Text-to-speech acoustic models selects only the Mel spectrum as an intermediate feature for converting text to speech. However, the Mel spectrograms obtained may exhibi...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:EURASIP journal on advances in signal processing 2024-12, Vol.2024 (1), p.4-14, Article 4
Hauptverfasser: Hu, Guoqiang, Ruan, Zhuofan, Guo, Wenqiu, Quan, Yujuan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Text-to-speech synthesis plays an essential role in facilitating human-computer interaction. Currently, the predominant approach in Text-to-speech acoustic models selects only the Mel spectrum as an intermediate feature for converting text to speech. However, the Mel spectrograms obtained may exhibit ambiguity in some aspects owing to the limited capability of the Fourier transform to capture mutation signals during the acquisition of the Mel spectrograms. With the aim of improving the clarity of synthesized speech, this study proposes a multi-task learning optimization method and conducts experiments on the Tacotron2 speech synthesis system to demonstrate the effectiveness of the proposed method. The method in the study introduces an additional task: wavelet spectrograms. The continuous wavelet transform has gained significant popularity in various applications, including speech enhancement and speech recognition, which is primarily attributed to its capability to adaptively vary the time-frequency resolution and its excellent performance in capturing non-stationary signals. This study highlights that the clarity of Tacotron2 synthesized speech can be improved by introducing Wavelet-spectrogram as an auxiliary task through theoretical and experimental analysis: a feature extraction network is added, and Wavelet-spectrogram features are extracted from the Mel spectrum output generated by the decoder. Experimental findings indicate that the Mean Opinion Score achieved for the speech synthesized by the model using multi-task learning is 0.17 higher compared to the baseline model. Furthermore, by analyzing the factors contributing to the success of the continuous wavelet transform-based multi-task learning method in the Tacotron2 model, as well as the effectiveness of multi-task learning, the study conjectures that the proposed method has the potential to enhance the performance of other acoustic models.
ISSN:1687-6180
1687-6172
1687-6180
DOI:10.1186/s13634-023-01096-x