Stochastic phonographic transduction for English

This paper introduces and reviews stochastic phonographic transduction (SPT), a trainable (“data-driven”) technique for letter-to-phoneme conversion based on formal language theory, as well as describing and detailing one particularly simple realization of SPT. The spellings and pronunciations of En...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Computer speech & language 1996-04, Vol.10 (2), p.133-153
Hauptverfasser: Luk, R.W.P., Damper, R.I.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:This paper introduces and reviews stochastic phonographic transduction (SPT), a trainable (“data-driven”) technique for letter-to-phoneme conversion based on formal language theory, as well as describing and detailing one particularly simple realization of SPT. The spellings and pronunciations of English words are modelled as the productions of a stochastic grammar, inferred from example data in the form of a pronouncing dictionary. The terminal symbols of the grammar are letter–phoneme correspondences, and the rewrite (production) rules of the grammar specify how these are combined to form acceptable English word spellings and their pronunciations. Given the spelling of a word as input, a pronunciation can then be produced as output by parsing the input string according to the letter-part of the terminals and selecting the “best” sequence of corresponding phoneme-parts according to some well-motivated criteria. Although the formalism is in principle very general, restrictive assumptions must be made if practical, trainable systems are to be realized. We have assumed at this stage that the grammar is regular. Further, word generation is modelled as a Markov process in which terminals (correspondences) are simply concatenated. The SPT learning task then amounts to the inference of a set of correspondences and estimation from the training data of their associated transition probabilities. Transduction to produce a pronunciation for a word given its spelling is achieved by Viterbi decoding, using a maximum likelihood criterion. Results are presented for letter–phoneme alignment and transduction for the dictionary training data, unseen dictionary words, unseen proper nouns and novel (pseudo-)words. Two different ways of inferring correspondences are described and compared. It is found that the provision of quite limited information about the alternating vowel/consonant structure of words aids the inference process significantly. Best transduction performance obtained on unseen dictionary words is 93·7% phonemes correct, conservatively scored. Automatically inferred correspondences also consistently out-perform a published set of manually derived correspondences when used for SPT. Although the comparison is difficult to make, we believe that current results for letter-to-phoneme conversion are at least as good as the best reported so far for a data-driven approach, while being comparable in performance to knowledge-based approaches.
ISSN:0885-2308
1095-8363
DOI:10.1006/csla.1996.0009