Collection-based compound noun segmentation for Korean information retrieval

Compound noun segmentation is a key first step in language processing for Korean. Thus far, most approaches require some form of human supervision, such as pre-existing dictionaries, segmented compound nouns, or heuristic rules. As a result, they suffer from the unknown word problem, which can be ov...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Information retrieval (Boston) 2006-11, Vol.9 (5), p.613-631
Hauptverfasser:	KANG, In-Su, NA, Seung-Hoon, LEE, Jong-Hyeok
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Candidates Codes Decomposition Dictionaries Exact sciences and technology Information and communication sciences Information retrieval Information science. Documentation Language Library and information science. General aspects Methods Natural language processing Sciences and techniques of general use Segmentation Studies Theories, definitions and sources in information science Words (language)
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Compound noun segmentation is a key first step in language processing for Korean. Thus far, most approaches require some form of human supervision, such as pre-existing dictionaries, segmented compound nouns, or heuristic rules. As a result, they suffer from the unknown word problem, which can be overcome by unsupervised approaches. However, previous unsupervised methods normally do not consider all possible segmentation candidates, and/or rely on character-based segmentation clues such as bi-grams or all-length n-grams. So, they are prone to falling into a local solution. To overcome the problem, this paper proposes an unsupervised segmentation algorithm that searches the most likely segmentation result from all possible segmentation candidates using a word-based segmentation context. As word-based segmentation clues, a dictionary is automatically generated from a corpus. Experiments using three test collections show that our segmentation algorithm is successfully applied to Korean information retrieval, improving a dictionary-based longest-matching algorithm.
ISSN:	1386-4564 1573-7659
DOI:	10.1007/s10791-006-9007-3