Unknown Chinese word extraction based on variety of overlapping strings

► We use unsupervised methods to extract meaningful words from Chinese sentences. ► We develop a new goodness measure to compute goodness score for word extraction. ► We compare the score of a word and those of the strings overlapping the word. ► A word is likely to be meaningful if its score is lar...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Information processing & management 2013-03, Vol.49 (2), p.497-512
Hauptverfasser: Ye, Yunming, Wu, Qingyao, Li, Yan, Chow, K.P., Hui, Lucas C.K., Yiu, S.M.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:► We use unsupervised methods to extract meaningful words from Chinese sentences. ► We develop a new goodness measure to compute goodness score for word extraction. ► We compare the score of a word and those of the strings overlapping the word. ► A word is likely to be meaningful if its score is larger than its overlapping ones. ► We show that the new measure is more effective than the other goodness measures. Not all languages, e.g. Chinese, have delimiters for words. To extract words from a sentence in these languages, we usually rely on a dictionary for known words. For unknown words, some approaches rely on a domain specific dictionary or a tailor-made learning data set. However, this information may not be available. Another direction is to use unsupervised methods. These methods rely on a goodness measure to evaluate how likely the words are meaningful based on a statistical argument on the given text. The most challenging issue is to identify low-frequency meaningful words. In this paper, we first show by an empirical study on Chinese texts that all classical goodness measures cannot separate low-frequency meaningful and meaningless words effectively. To solve this problem, we propose a new goodness measure, the overlap variety method. The key idea behind the new measure is not to consider the absolute number of occurrences of the candidate (i.e., a string of Chinese characters) but to compare the goodness measures (we use the accessor variety) of the candidate and those of the strings overlapping the candidate. The candidate is likely to be meaningful if its accessor variety is larger than the accessor varieties of the overlapping strings. We implement an extraction system for unknown Chinese word, UNExtract, based on this overlap variety method. We evaluate our approach using the CIPS-SIGHAN-2010 bake off corpora and show that the proposed measure is more effective than the other five state-of-the-art goodness measures (accessor variety, branch entropy, description length gain, frequency substring reduction, pointwise mutual information), especially for low-frequency words and bi-gram words.
ISSN:0306-4573
1873-5371
DOI:10.1016/j.ipm.2012.09.004