Frequency Warping for Speaker Adaptation in HMM-based Speech Synthesis

Speaker adaptation in speech synthesis transforms a source utterance to a target utterance that differs from the source in terms of voice characteristics. In this paper, we employ vocal tract length normalization, which is generally used in speech recognition to remove individual speaker characteris...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of Information Science and Engineering 2014-07, Vol.30 (4), p.1149-1166
Hauptverfasser:	高伟勋(Wei-Xun Gao), 曹奇英(Qi-Ying Cao)
Format:	Artikel
Sprache:	eng
Schlagworte:	Acoustic signal processing Acoustics Adaptation Applied sciences Artificial intelligence Biological and medical sciences Computer science control theory systems Ear and associated structures. Auditory pathways and centers. Hearing. Vocal organ. Phonation. Sound production. Echolocation Exact sciences and technology Fundamental and applied biological sciences. Psychology Fundamental areas of phenomenology (including applications) Information, signal and communications theory Mathematical models Physics Regression Signal processing Similarity Spectra Speech and sound recognition and synthesis. Linguistics Speech processing Speech recognition Telecommunications and information theory Vertebrates: nervous system and sense organs Warpage Warping
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Speaker adaptation in speech synthesis transforms a source utterance to a target utterance that differs from the source in terms of voice characteristics. In this paper, we employ vocal tract length normalization, which is generally used in speech recognition to remove individual speaker characteristics, to speaker adaptation in speech synthesis. We propose a frequency warping approach based on a time-varying bilinear function to reduce the weighted spectral distance between the source speaker and the target speaker. The warped spectra of the source speaker are then converted to line spectrum pairs to train hidden Markov models (HMM). HMMs are further adapted by algorithms based on maximum likelihood linear regression with the target speaker's data. The experimental results show that our frequency warping approach can make the warped spectra of the source speaker closer to the target speaker, and the resultant adapted HMMs perform better than the HMMs trained by unwrapped spectra in terms of synthesized speech naturalness and speaker similarity.
ISSN:	1016-2364
DOI:	10.6688/JISE.2014.30.4.13