Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization

An increasingly common scenario in building speech synthesis and recognition systems is training on inhomogeneous data. This paper proposes a new framework for estimating hidden Markov models on data containing both multiple speakers and multiple languages. The proposed framework, speaker and langua...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on audio, speech, and language processing speech, and language processing, 2012-08, Vol.20 (6), p.1713-1724
Hauptverfasser:	Zen, H., Braunschweiler, N., Buchholz, S., Gales, M. J. F., Knill, K., Krstulovic, S., Latorre, J.
Format:	Artikel
Sprache:	eng
Schlagworte:	Adaptation models Applied sciences Clusters Decision trees Exact sciences and technology Factorization Hidden Markov models Hidden Markov models (HMMs) Information, signal and communications theory Interpolation Mathematical models Signal processing speaker and language factorization Speech Speech processing Speech recognition Speech synthesis statistical parametric speech synthesis Studies Telecommunications and information theory Transforms Vectors
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	An increasingly common scenario in building speech synthesis and recognition systems is training on inhomogeneous data. This paper proposes a new framework for estimating hidden Markov models on data containing both multiple speakers and multiple languages. The proposed framework, speaker and language factorization, attempts to factorize speaker-/language-specific characteristics in the data and then model them using separate transforms. Language-specific factors in the data are represented by transforms based on cluster mean interpolation with cluster-dependent decision trees. Acoustic variations caused by speaker characteristics are handled by transforms based on constrained maximum-likelihood linear regression. Experimental results on statistical parametric speech synthesis show that the proposed framework enables data from multiple speakers in different languages to be used to: train a synthesis system; synthesize speech in a language using speaker characteristics estimated in a different language; and adapt to a new language.
ISSN:	1558-7916 2329-9290 1558-7924 2329-9304
DOI:	10.1109/TASL.2012.2187195