HierTTS: Expressive End-to-End Text-to-Waveform Using a Multi-Scale Hierarchical Variational Auto-Encoder

End-to-end text-to-speech (TTS) models that directly generate waveforms from text are gaining popularity. However, existing end-to-end models are still not natural enough in their prosodic expressiveness. Additionally, previous studies on improving the expressiveness of TTS have mainly focused on ac...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Applied sciences 2023-01, Vol.13 (2), p.868
Hauptverfasser:	Shang, Zengqiang, Shi, Peiyang, Zhang, Pengyuan, Wang, Li, Zhao, Guangying
Format:	Artikel
Sprache:	eng
Schlagworte:	Acoustics end-to-end expressive TTS hierarchical VAE Information processing Linguistics multi-scale Normal distribution Phonetics Phonology Prosody Speech Speech recognition Text-to-speech Variables Waveforms
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	End-to-end text-to-speech (TTS) models that directly generate waveforms from text are gaining popularity. However, existing end-to-end models are still not natural enough in their prosodic expressiveness. Additionally, previous studies on improving the expressiveness of TTS have mainly focused on acoustic models. There is a lack of research on enhancing expressiveness in an end-to-end framework. Therefore, we propose HierTTS, a highly expressive end-to-end text-to-waveform generation model. It deeply couples the hierarchical properties of speech with hierarchical variational auto-encoders and models multi-scale latent variables, at the frame, phone, subword, word, and sentence levels. The hierarchical encoder encodes the speech signal from fine-grained features into coarse-grained latent variables. In contrast, the hierarchical decoder generates fine-grained features conditioned on the coarse-grained latent variables. We propose a staged KL-weighted annealing strategy to prevent hierarchical posterior collapse. Furthermore, we employ a hierarchical text encoder to extract linguistic information at different levels and act on both the encoder and the decoder. Experiments show that our model performs closer to natural speech in prosody expressiveness and has better generative diversity.
ISSN:	2076-3417 2076-3417
DOI:	10.3390/app13020868