Improving Deep Neural Network Based Speech Synthesis through Contextual Feature Parametrization and Multi-Task Learning

We propose three techniques to improve speech synthesis based on deep neural network (DNN). First, at the DNN input we use real-valued contextual feature vector to represent phoneme identity, part of speech and pause information instead of the conventional binary vector. Second, at the DNN output la...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of signal processing systems 2018-07, Vol.90 (7), p.1025-1037
Hauptverfasser:	Wen, Zhengqi, Li, Kehuang, Huang, Zhen, Lee, Chin-Hui, Tao, Jianhua
Format:	Artikel
Sprache:	eng
Schlagworte:	Circuits and Systems Computer Imaging Electrical Engineering Engineering Image Processing and Computer Vision Learning Neural networks Parameterization Parameters Pattern Recognition Pattern Recognition and Graphics Recurrent neural networks Signal,Image and Speech Processing Speech Speech recognition Vision Voice simulation
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	We propose three techniques to improve speech synthesis based on deep neural network (DNN). First, at the DNN input we use real-valued contextual feature vector to represent phoneme identity, part of speech and pause information instead of the conventional binary vector. Second, at the DNN output layer, parameters for pitch-scaled spectrum and aperiodicity measures are estimated for constructing the excitation signal used in our baseline synthesis vocoder. Third, the bidirectional recurrent neural network architecture with long short term memory (BLSTM) units is adopted and trained with multi-task learning for DNN-based speech synthesis. Experimental results demonstrate that the quality of synthesized speech has been improved by adopting the new input vector and output parameters. The proposed BLSTM architecture for DNN is also beneficial to learning the mapping function from the input contextual feature to the speech parameters and to improve speech quality.
ISSN:	1939-8018 1939-8115
DOI:	10.1007/s11265-017-1293-z