Duplicate Record Detection: A Survey

Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standar...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on knowledge and data engineering 2007-01, Vol.19 (1), p.1-16
Hauptverfasser:	Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Applied sciences Approximation Cleaning Computer errors Computer science control theory systems Computer Society Cost function Couplings data cleaning data deduplication data integration Data processing. List processing. Character string processing database hardening Detection algorithms Duplicate detection entity matching entity resolution Errors Exact sciences and technology fuzzy duplicate detection identity uncertainty instance identification Memory organisation. Data processing Mirrors name matching record linkage Relational databases Representations Reproduction Scalability Similarity Software Tasks Uncertainty
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area
ISSN:	1041-4347 1558-2191
DOI:	10.1109/TKDE.2007.250581