Multi-Distribution Characteristics Based Chinese Entity Synonym Extraction from The Web

Entity synonyms play an important role in natural language processing applications, such as query expansion and question answering. There are three main distribution characteristics in web texts:1) appearing in parallel structures; 2) occurring with specific patterns in sentences; and 3) distributed...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International journal of intelligent information technologies 2019-07, Vol.15 (3), p.42-63
Hauptverfasser: Luo, Xiangfeng, Ma, Xiuxia, Huang, Subin, Guo, Yike
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Entity synonyms play an important role in natural language processing applications, such as query expansion and question answering. There are three main distribution characteristics in web texts:1) appearing in parallel structures; 2) occurring with specific patterns in sentences; and 3) distributed in similar contexts. The first and second characteristics rely on reliable prior knowledge and are susceptive to data sparseness, bringing high accuracy and low recall to synonym extraction. The third one may lead to high recall but low accuracy, since it identifies a somewhat loose semantic similarity. Existing methods, such as context-based and pattern-based methods, only consider one characteristic for synonym extraction and rarely take their complementarity into account. For increasing recall, this article proposes a novel extraction framework that can combine the three characteristics for extracting synonyms from the web, where an Entity Synonym Network (ESN) is built to incorporate synonymous knowledge. To improve accuracy, the article treats synonym detection as a ranking problem and uses the Spreading Activation model as a ranking means to detect the hard noise in ESN. Experimental results show the proposed method achieves better accuracy and recall than the state-of-the-art methods.
ISSN:1548-3657
1548-3665
DOI:10.4018/IJIIT.2019070103