On the use of neural networks in articulatory speech synthesis

A long-standing problem in the analysis and synthesis of speech by articulatory description is the estimation of the vocal tract shape parameters from natural input speech. Methods to relate spectral parameters to articulatory positions are feasible if a sufficiently large amount of data is availabl...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:The Journal of the Acoustical Society of America 1993-02, Vol.93 (2), p.1109-1121
Hauptverfasser: RAHIM, M. G, GOODYEAR, C. C, KLEIJN, W. B, SCHROETER, J, SONDHI, M. M
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:A long-standing problem in the analysis and synthesis of speech by articulatory description is the estimation of the vocal tract shape parameters from natural input speech. Methods to relate spectral parameters to articulatory positions are feasible if a sufficiently large amount of data is available. This, however, results in a high computational load and large memory requirements. Further, one needs to accommodate ambiguities in this mapping due to the nonuniqueness problem (i.e., several vocal tract shapes can result in identical spectral envelopes). This paper describes the use of artificial neural networks for acoustic to articulatory parameter mapping. Experimental results show that a single feed-forward neural net is unable to perform this mapping sufficiently well when trained on a large data set. An alternative procedure is proposed, based on an assembly of neural networks. Each network is designated to a specific region in the articulatory space, and performs a mapping from cepstral values into tract areas. The training of this assembly is executed in two stages: In the first stage, a codebook of suitably normalized articulatory parameters is used, and in the second stage, real speech data are used to further improve the mapping. During synthesis, neural networks are selected by dynamic programming using a criterion that ensures smoothly varying vocal tract shapes while maintaining a good spectral match. The method is able to accommodate nonuniqueness in acoustic-to-articulatory mapping and can be bootstrapped efficiently from natural speech. Results on the performance of this procedure compared to other mapping procedures, including codebook look-up and a single multilayered network, are presented.
ISSN:0001-4966
1520-8524
DOI:10.1121/1.405559