High Order Conditional Random Field Based Part of Speech Tagger and Ambiguity Resolver for Malayalam -a Highly Agglutinative Language

Parts of speech tagging also called grammatical tagging assign lexical class markers to each and every word in a document. It is an essential and important preprocessing step in many NLP systems. Tagged corpora play a significant role in Machine Translation, Information Retrieval, and Data Mining. P...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International journal of advanced research in computer science 2011-09, Vol.2 (5)
Hauptverfasser: S, Bindu M, Idicula, Sumam Mary
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Parts of speech tagging also called grammatical tagging assign lexical class markers to each and every word in a document. It is an essential and important preprocessing step in many NLP systems. Tagged corpora play a significant role in Machine Translation, Information Retrieval, and Data Mining. POS tagging in Malayalam is a difficult task as it is an agglutinative language and 80-85% of words in Malayalam text documents are compound words. Decomposition of these words into its constituents is extremely necessary for finalizing the POS tag of these words. Sometimes more than one morphological analysis and hence more than one POS may occur for a single word. A correct resolution of this kind of ambiguity for each occurrence of the word is crucial in many NLP applications. Currently available tag sets in other languages are only giving importance to the morphological and syntactical properties of the language while the tag set designed by us considers the semantic features of the language. For testing this system, documents from well known Malayalam news papers and magazines are selected. Up to 2352 sentences are tested which includes simple, complex and compound type sentences. Word level tagging accuracy of 95% and sentence level accuracy of 91% are obtained.
ISSN:0976-5697