TTS-by-TTS: TTS-driven Data Augmentation for Fast and High-Quality Speech Synthesis
In this paper, we propose a text-to-speech (TTS)-driven data augmentation method for improving the quality of a non-autoregressive (AR) TTS system. Recently proposed non-AR models, such as FastSpeech 2, have successfully achieved fast speech synthesis system. However, their quality is not satisfacto...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this paper, we propose a text-to-speech (TTS)-driven data augmentation
method for improving the quality of a non-autoregressive (AR) TTS system.
Recently proposed non-AR models, such as FastSpeech 2, have successfully
achieved fast speech synthesis system. However, their quality is not
satisfactory, especially when the amount of training data is insufficient. To
address this problem, we propose an effective data augmentation method using a
well-designed AR TTS system. In this method, large-scale synthetic corpora
including text-waveform pairs with phoneme duration are generated by the AR TTS
system and then used to train the target non-AR model. Perceptual listening
test results showed that the proposed method significantly improved the quality
of the non-AR TTS system. In particular, we augmented five hours of a training
database to 179 hours of a synthetic one. Using these databases, our TTS system
consisting of a FastSpeech 2 acoustic model with a Parallel WaveGAN vocoder
achieved a mean opinion score of 3.74, which is 40% higher than that achieved
by the conventional method. |
---|---|
DOI: | 10.48550/arxiv.2010.13421 |