Lip movement synthesis from speech based on Hidden Markov Models

Speech intelligibility can be improved by adding lip images to the speech signal. Thus lip movement synthesis plays an important role to realize a natural human-like face of computer agents. This paper proposes a novel, lip movement synthesis method from speech input based on the Hidden Markov Model...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Speech communication 1998-10, Vol.26 (1), p.105-115
Hauptverfasser:	Yamamoto, E., Nakamura, S., Shikano, K.
Format:	Artikel
Sprache:	eng
Schlagworte:	Applied sciences Coarticulation Exact sciences and technology Hidden Markov model Information, signal and communications theory Lip movement synthesis Lip synchronization Multimodal interface Signal processing Speech processing Telecommunications and information theory Viterbi alignment
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Speech intelligibility can be improved by adding lip images to the speech signal. Thus lip movement synthesis plays an important role to realize a natural human-like face of computer agents. This paper proposes a novel, lip movement synthesis method from speech input based on the Hidden Markov Models (HMMs). The difficulties of lip movement synthesis are caused by coarticulation effects from preceding and succeeding phonemes. The proposed method gives a simple solution that generates context dependent lip parameters by looking ahead to the HMM state sequence obtained using context independent HMMs. In objective evaluation experiments, the proposed method is evaluated by the time-averaged error and the time-averaged differential error between synthesized lip parameters and original ones. The result shows that the time-averaged error and the time-averaged differential error of the HMM-based method with context independent lip parameters are 8.7% and 32% smaller than those obtained using a Vector Quantization (VQ) based method. Moreover, the time-averaged error and time-averaged differential error generated by the proposed HMM-based method with context dependent lip parameters are further reduced by 10.5% and 11% compared to the HMM-based method with the context independent lip parameters. The proposed HMM-based method with context dependent lip parameters mostly reduces the errors of phonemes /h/, /g/ and /k/. In subjective evaluation experiments, although differences in the audio-visual intelligibility between the synthesized lip parameters and the original ones are insignificant, the acceptability test to evaluate naturalness reflects the results of the objective evaluation. Mean opinion scores of acceptability for the VQ-based method and the proposed HMM-based method are 3.25 and 3.74, respectively.
ISSN:	0167-6393 1872-7182
DOI:	10.1016/S0167-6393(98)00054-5