Mining document collections to facilitate accurate approximate entity matching

Many entity extraction techniques leverage large reference entity tables to identify entities in documents. Often, an entity is referenced in document collections differently from that in the reference entity tables. Therefore, we study the problem of determining whether or not a substring "app...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Proceedings of the VLDB Endowment 2009-08, Vol.2 (1), p.395-406
Hauptverfasser:	Chaudhuri, Surajit, Ganti, Venkatesh, Xin, Dong
Format:	Artikel
Sprache:	eng
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Many entity extraction techniques leverage large reference entity tables to identify entities in documents. Often, an entity is referenced in document collections differently from that in the reference entity tables. Therefore, we study the problem of determining whether or not a substring "approximately" matches with a reference entity. Similarity measures which exploit the correlation between candidate substrings and reference entities across a large number of documents are known to be more robust than traditional stand alone string-based similarity functions. However, such an approach has significant efficiency challenges. In this paper, we adopt a new architecture and propose new techniques to address these efficiency challenges. We mine document collections and expand a given reference entity table with variations of each of its entities. Thus, the problem of approximately matching an input string against reference entities reduces to that of exact match against the expanded reference table, which can be implemented efficiently. In an extensive experimental evaluation, we demonstrate the accuracy and scalability of our techniques.
ISSN:	2150-8097 2150-8097
DOI:	10.14778/1687627.1687673