A comparative study on Thai word segmentation approaches

In this paper, we analyze and compare various approaches for Thai word segmentation. The word segmentation approaches could be classified into two distinct types, dictionary based (DCB) and machine learning based (MLB). The DCB approach relies on a set of terms for parsing and segmenting input texts...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Haruechaiyasak, C., Kongyoung, S., Dailey, M.
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In this paper, we analyze and compare various approaches for Thai word segmentation. The word segmentation approaches could be classified into two distinct types, dictionary based (DCB) and machine learning based (MLB). The DCB approach relies on a set of terms for parsing and segmenting input texts. Whereas the MLB approach relies on a model trained from a corpus by using machine learning techniques. We compare between two algorithms from the DCB approach: longest-matching and maximal matching, and four algorithms from the MLB approach: Naive Bayes (NB), decision tree, support vector machine (SVM), and conditional random field (CRF). From the experimental results, the DCB approach yielded better performance than the NB, decision tree and SVM algorithms from the MLB approach. However, the best performance was obtained from the CRF algorithm with the precision and recall of 95.79% and 94.98%, respectively.
DOI:10.1109/ECTICON.2008.4600388