Improving robustness of spontaneous speech synthesis with linguistic speech regularization and pseudo-filled-pause insertion
We present a training method with linguistic speech regularization that improves the robustness of spontaneous speech synthesis methods with filled pause (FP) insertion. Spontaneous speech synthesis is aimed at producing speech with human-like disfluencies, such as FPs. Because modeling the complex...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We present a training method with linguistic speech regularization that
improves the robustness of spontaneous speech synthesis methods with filled
pause (FP) insertion. Spontaneous speech synthesis is aimed at producing speech
with human-like disfluencies, such as FPs. Because modeling the complex data
distribution of spontaneous speech with a rich FP vocabulary is challenging,
the quality of FP-inserted synthetic speech is often limited. To address this
issue, we present a method for synthesizing spontaneous speech that improves
robustness to diverse FP insertions. Regularization is used to stabilize the
synthesis of the linguistic speech (i.e., non-FP) elements. To further improve
robustness to diverse FP insertions, it utilizes pseudo-FPs sampled using an FP
word prediction model as well as ground-truth FPs. Our experiments demonstrated
that the proposed method improves the naturalness of synthetic speech with
ground-truth and predicted FPs by 0.24 and 0.26, respectively. |
---|---|
DOI: | 10.48550/arxiv.2210.09815 |