Generative and Discriminative Methods Using Morphological Information for Sentence Segmentation of Turkish

This paper presents novel methods for generative, discriminative, and hybrid sequence classification for segmentation of Turkish word sequences into sentences. In the literature, this task is generally solved using statistical models that take advantage of lexical information among others. However,...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on audio, speech, and language processing speech, and language processing, 2009-07, Vol.17 (5), p.895-903
Hauptverfasser:	Guz, U., Favre, B., Hakkani-Tur, D., Tur, G.
Format:	Artikel
Sprache:	eng
Schlagworte:	Automatic speech recognition Boosting Classification Computation and Language Computer Science Data mining Feature extraction Handles Hidden Markov models Hybrid power systems Mathematical models Morphology Natural language processing Natural languages Prosodic and lexical information Segmentation sentence segmentation Sentences Speech Tasks Turkish morphology Vocabulary
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This paper presents novel methods for generative, discriminative, and hybrid sequence classification for segmentation of Turkish word sequences into sentences. In the literature, this task is generally solved using statistical models that take advantage of lexical information among others. However, Turkish has a productive morphology that generates a very large vocabulary, making the task much harder. In this paper, we introduce a new set of morphological features, extracted from words and their morphological analyses. We also extend the established method of hidden event language modeling (HELM) to factored hidden event language modeling (fHELM) to handle morphological information. In order to capture non-lexical information, we extract a set of prosodic features, which are mainly motivated from our previous work for other languages. We then employ discriminative classification techniques, boosting and conditional random fields (CRFs), combined with fHELM, for the task of Turkish sentence segmentation.
ISSN:	1558-7916 2329-9290 1558-7924 2329-9304
DOI:	10.1109/TASL.2009.2016393