Cross-Dialect Adaptation Framework for Constructing Prosodic Models for Chinese Dialect Text-to-Speech Systems

This paper presents an efficient cross-dialect adaptation framework for constructing prosodic models for Chinese dialect text-to-speech systems. In this framework, dialect prosodic models are adapted from an existing Mandarin speaking rate-dependent hierarchical prosodic model. The rationale of the...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2018-01, Vol.26 (1), p.108-121
1. Verfasser: Chiang, Chen-Yu
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:This paper presents an efficient cross-dialect adaptation framework for constructing prosodic models for Chinese dialect text-to-speech systems. In this framework, dialect prosodic models are adapted from an existing Mandarin speaking rate-dependent hierarchical prosodic model. The rationale of the framework is based on the cross-dialectal similarities between Mandarin and other Chinese dialects in terms of syntactic and prosodic structures. Two main problems are addressed in this study: One problem pertains to the use of cross-dialectal similarities in the design and adaptation of the dialect speaking rate-dependent hierarchical prosodic model. The other problem pertains to the data sparseness caused by the insufficiency of an adaptation corpus covering essential linguistic contexts and prosodic events as well as a wide speaking rate range. This problem is solved by employing the structural maximum a posteriori method that hierarchically organizes the dialect speaking rate-dependent hierarchical prosodic model parameters into decision trees to facilitate parameter estimations. The effectiveness of the proposed approach was evaluated by experiments on two Chinese dialects: Min and Hakka. Objective and subjective evaluations demonstrated that the prosodic features generated by the dialect speaking rate-dependent hierarchical prosodic models were quite natural in various speaking rates ranging from 3.3 to 6.7 syllables per second. These results confirm that the proposed cross-dialect adaptation framework is effective and promising.
ISSN:2329-9290
2329-9304
DOI:10.1109/TASLP.2017.2762432