Personalized Spontaneous Speech Synthesis Using a Small-Sized Unsegmented Semispontaneous Speech

A systematic approach is proposed to synthesizing personalized spontaneous speech using a small-sized unsegmented speech corpus of the target speaker. First, an automatic segmentation algorithm is employed to segment and label a collected semispontaneous speech corpus of the target speaker. Then, a...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2017-05, Vol.25 (5), p.1048-1060
Hauptverfasser:	Huang, Yi-Chin, Wu, Chung-Hsien, Chen, Yan-You, Shie, Ming-Ge, Wang, Jhing-Fa
Format:	Artikel
Sprache:	eng
Schlagworte:	Adaptation models Algorithms Corpus linguistics Data models Fluency Hidden Markov models Linguistics Personalized speech synthesis Prosody Segmentation Similarity Smoothing Smoothing methods Speech Speech disorders Speech recognition speech segmentation Speech synthesis Spontaneous speech spontaneous speech synthesis Synthesis Voice simulation
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	A systematic approach is proposed to synthesizing personalized spontaneous speech using a small-sized unsegmented speech corpus of the target speaker. First, an automatic segmentation algorithm is employed to segment and label a collected semispontaneous speech corpus of the target speaker. Then, a pretrained average voice model is adapted to the voice model of the target speaker by using the segmented data. A postfilter based on modulation spectrum is adopted to further improve the speaker similarity of the synthesized speech as well as alleviate the over-smoothing problem of the synthesized speech. For generating spontaneous speech, a smoothing method applied at the prosodic word level is proposed to improve speech fluency. For objective evaluation on spontaneous speech segmentation, the segmentation accuracy of the proposed method is superior to that of Viterbi-based forced alignment. The results of subjective listening test also show that the proposed method can improve the spontaneity and speaker similarity of the synthesized speech compared to the maximum likelihood linear regression based speaker adaptation method.
ISSN:	2329-9290 2329-9304
DOI:	10.1109/TASLP.2017.2679603