Noise and acoustic modeling with waveform generator in text-to-speech and neutral speech conversion
This article focuses on developing a system for high-quality synthesized and converted speech by addressing three fundamental principles. Although the noise-like component in the state-of-the-art parametric vocoders (for example, STRAIGHT) is often not accurate enough, a novel analytical approach fo...
Gespeichert in:
Veröffentlicht in: | Multimedia tools and applications 2021, Vol.80 (2), p.1969-1994 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This article focuses on developing a system for high-quality synthesized and converted speech by addressing three fundamental principles. Although the noise-like component in the state-of-the-art parametric vocoders (for example, STRAIGHT) is often not accurate enough, a novel analytical approach for modeling unvoiced excitations using a temporal envelope is proposed. Discrete All Pole, Frequency Domain Linear Prediction, Low Pass Filter, and True envelopes are firstly studied and applied to the noise excitation signal in our continuous vocoder. Second, we build a deep learning model based text–to–speech (TTS) which converts written text into human-like speech with a feed-forward and several sequence-to-sequence models (long short-term memory, gated recurrent unit, and hybrid model). Third, a new voice conversion system is proposed using a continuous fundamental frequency to provide accurate time-aligned voiced segments. The results have been evaluated in terms of objective measures and subjective listening tests. Experimental results showed that the proposed models achieved the highest speaker similarity and better quality compared with the other conventional methods. |
---|---|
ISSN: | 1380-7501 1573-7721 |
DOI: | 10.1007/s11042-020-09783-9 |