NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality

Text-to-speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality, and how to achieve it. In this paper, we answer these questions by first defining th...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence 2024-06, Vol.46 (6), p.4234-4245
Hauptverfasser: Tan, Xu, Chen, Jiawei, Liu, Haohe, Cong, Jian, Zhang, Chen, Liu, Yanqing, Wang, Xi, Leng, Yichong, Yi, Yuanhao, He, Lei, Zhao, Sheng, Qin, Tao, Soong, Frank, Liu, Tie-Yan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Text-to-speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality, and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on benchmark datasets. Specifically, we leverage a variational auto-encoder (VAE) for end-to-end text-to-waveform generation, with several key modules to enhance the capacity of the prior from text and reduce the complexity of the posterior from speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior modeling, and a memory mechanism in VAE. Experimental evaluations on the popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 -0.01 CMOS (comparative mean opinion score) to human recordings at the sentence level, with Wilcoxon signed rank test at p-level p \gg 0.05 p≫0.05 , which demonstrates no statistically significant difference from human recordings for the first time.
ISSN:0162-8828
1939-3539
1939-3539
2160-9292
DOI:10.1109/TPAMI.2024.3356232