A Stochastic Arabic Diacritizer Based on a Hybrid of Factorized and Unfactorized Textual Features

This paper introduces a large-scale dual-mode stochastic system to automatically diacritize raw Arabic text. The first of these modes determines the most likely diacritics by choosing the sequence of full-form Arabic word diacritizations with maximum marginal probability via A^ lattice search and lo...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on audio, speech, and language processing speech, and language processing, 2011-01, Vol.19 (1), p.166-175
Hauptverfasser:	Rashwan, Mohsen A A, Al-Badrashiny, Mohamed A S A A, Attia, Mohamed, Abdou, Sherif M, Rafea, Ahmed
Format:	Artikel
Sprache:	eng
Schlagworte:	A^{\ast} search Applied sciences Arabic case-ending Computer science corpus-based linguistics coverage Derivatives diacritics diacritization disambiguation Exact sciences and technology factorized features human language technologies (HLT) hybrid Hybrid systems Information, signal and communications theory language factorization language modeling language models language processing Large-scale systems Lattice vibration Lattices Miscellaneous morphological analysis morphology n-grams Natural language processing natural language processing (NLP) Performance evaluation phonetic transcription phonological analysis Signal processing Speech Speech synthesis statistical statistical language model (SLM) stochastic Stochastic processes Stochastic systems Studies Switches Syntax System testing Telecommunications and information theory Texts Training unfactorized features Vocabulary vowelization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This paper introduces a large-scale dual-mode stochastic system to automatically diacritize raw Arabic text. The first of these modes determines the most likely diacritics by choosing the sequence of full-form Arabic word diacritizations with maximum marginal probability via A^ lattice search and long-horizon n-grams probability estimation. When full-form words are OOV, the system switches to the second mode which factorizes each Arabic word into all its possible morphological constituents, then uses also the same techniques used by the first mode to get the most likely sequence of morphemes, hence the most likely diacritization. While the second mode achieves a far better coverage of the highly derivative and inflective Arabic language, the first mode is faster to learn, i.e., yields better disambiguation results for the same size of training corpora, especially for inferring syntactical (case-ending) diacritics. Our presented hybrid system that benefits from the advantages of both modes has experimentally been found superior to the best performing reported systems of Habash and Rambow, and of Zitouni, using the same training and test corpus for the sake of fair comparison. The word error rates of (morphological diacritization, overall diacritization including the case endings) for the three systems are, respectively, as follows (3.1%, 12.5%), (5.5%, 14.9%), and (7.9%, 18%). The hybrid architecture of language factorizing and unfactorizing components may be inspiring to other NLP/HLT problems in analogous situations.
ISSN:	1558-7916 2329-9290 1558-7924 2329-9304
DOI:	10.1109/TASL.2010.2045240