High-Quality Text-to-Speech Implementation via Active Shallow Diffusion Mechanism

Denoising diffusion probabilistic models (DDPMs) have proven to be useful in text-to-speech (TTS) tasks; however, it has been a challenge for traditional diffusion models to carry out real-time processing because of the need for hundreds of sampling steps during the iteration. In this work, a two-st...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Sensors (Basel, Switzerland) Switzerland), 2025-01, Vol.25 (3), p.833
Hauptverfasser: Deng, Junlin, Hou, Ruihan, Deng, Yan, Long, Yongqiu, Wu, Ning
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Denoising diffusion probabilistic models (DDPMs) have proven to be useful in text-to-speech (TTS) tasks; however, it has been a challenge for traditional diffusion models to carry out real-time processing because of the need for hundreds of sampling steps during the iteration. In this work, a two-stage fast inference and efficient diffusion-based acoustic model of TTS, the Cascaded MixGAN-TTS (CMG-TTS), is proposed to address this problem. An active shallow diffusion mechanism is adopted to divide the CMG-TTS training process into two stages. Specifically, a basic acoustic model in the first stage is trained to provide valuable a priori knowledge for the second stage, and for the underlying acoustic modeling, a mixture combination mechanism-based linguistic encoder is introduced to work with pitch and energy predictors. In the following stage of processing, a post-net is used to optimize the mel-spectrogram reconstruction performance. The CMG-TTS is evaluated on datasets such as the AISHELL3 and LJSpeech, and the experiments show that the CMG-TTS achieves satisfactory results in both subjective and objective evaluation metrics with only one denoising step. Compared to other TTS models based on diffusion modeling, the CMG-TTS obtains a leading score in the real time factor (RTF), and both stages of the CMG-TTS are effective in the ablation studies.
ISSN:1424-8220
1424-8220
DOI:10.3390/s25030833