Morpholexical and Discriminative Language Models for Turkish Automatic Speech Recognition
This paper introduces two complementary language modeling approaches for morphologically rich languages aiming to alleviate out-of-vocabulary (OOV) word problem and to exploit morphology as a knowledge source. The first model, morpholexical language model, is a generative n -gram model, where modeli...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on audio, speech, and language processing speech, and language processing, 2012-10, Vol.20 (8), p.2341-2351 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This paper introduces two complementary language modeling approaches for morphologically rich languages aiming to alleviate out-of-vocabulary (OOV) word problem and to exploit morphology as a knowledge source. The first model, morpholexical language model, is a generative n -gram model, where modeling units are lexical-grammatical morphemes instead of commonly used words or statistical sub-words. This paper also proposes a novel approach for integrating the morphology into an automatic speech recognition (ASR) system in the finite-state transducer framework as a knowledge source. We accomplish that by building a morpholexical search network obtained by the composition of lexical transducer of a computational lexicon with a morpholexical language model. The second model is a linear reranking model trained discriminatively with a variant of the perceptron algorithm using morpholexical features. This variant of the perceptron algorithm, WER-sensitive perceptron, is shown to perform better for reranking n -best candidates obtained with the generative model. We apply the proposed models in Turkish broadcast news transcription task and give experimental results. The morpholexical model leads to an elegant morphology-integrated search network with unlimited vocabulary. Thus, it is highly effective in alleviating OOV problem and improves the word error rate (WER) over word and statistical sub-word models by 1.8% and 0.4% absolute, respectively. The discriminatively trained morpholexical model further improves the WER of the system by 0.8% absolute. |
---|---|
ISSN: | 1558-7916 1558-7924 |
DOI: | 10.1109/TASL.2012.2201477 |