A node resistance-based probability model for resolving duplicate named entities

Duplicate entities tend to degrade the quality of data seriously. Despite recent remarkable achievement, existing methods still produce a large number of false positives (i.e., an entity determined to be a duplicate one when it is not) that are likely to impair the accuracy. Toward this challenge, w...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Scientometrics 2020-09, Vol.124 (3), p.1721-1743
Hauptverfasser:	Kang, Namyong, Kim, Jeong-Jae, On, Byung-Won, Lee, Ingyu
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science Computer Science, Interdisciplinary Applications Datasets Information Science & Library Science Information Storage and Retrieval Library Science Nodes Probability Production methods Reproduction (copying) Science & Technology Technology
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Duplicate entities tend to degrade the quality of data seriously. Despite recent remarkable achievement, existing methods still produce a large number of false positives (i.e., an entity determined to be a duplicate one when it is not) that are likely to impair the accuracy. Toward this challenge, we propose a novel node resistance-based probability model in which we view a given data set as a graph of entities that are linked each other via relationships, and then compute the probability value between two entities to see how similar the two entities are. Especially, in the graph, each node has its own resistance value equivalent to 1-confidence (normalized in 0–1) and resistance · probability value is filtered out per node during computing the probability value. To evaluate the proposed model, we performed intensive experiments with different data sets including ACM ( https://dl.acm.org ), DBLP ( https://dblp.uni-trier.de ), and IMDB ( https://imdb.com ). Our experimental results show that the proposed probability model outperforms the existing probability model, improving average F1 scores up to 14%, but never worsens them.
ISSN:	0138-9130 1588-2861
DOI:	10.1007/s11192-020-03585-4