FeatherTTS: Robust and Efficient attention based Neural TTS
Attention based neural TTS is elegant speech synthesis pipeline and has shown a powerful ability to generate natural speech. However, it is still not robust enough to meet the stability requirements for industrial products. Besides, it suffers from slow inference speed owning to the autoregressive g...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Attention based neural TTS is elegant speech synthesis pipeline and has shown
a powerful ability to generate natural speech. However, it is still not robust
enough to meet the stability requirements for industrial products. Besides, it
suffers from slow inference speed owning to the autoregressive generation
process. In this work, we propose FeatherTTS, a robust and efficient
attention-based neural TTS system. Firstly, we propose a novel Gaussian
attention which utilizes interpretability of Gaussian attention and the strict
monotonic property in TTS. By this method, we replace the commonly used stop
token prediction architecture with attentive stop prediction. Secondly, we
apply block sparsity on the autoregressive decoder to speed up speech
synthesis. The experimental results show that our proposed FeatherTTS not only
nearly eliminates the problem of word skipping, repeating in particularly hard
texts and keep the naturalness of generated speech, but also speeds up acoustic
feature generation by 3.5 times over Tacotron. Overall, the proposed FeatherTTS
can be $35$x faster than real-time on a single CPU. |
---|---|
DOI: | 10.48550/arxiv.2011.00935 |