Framework for evaluating clustering algorithms in duplicate detection
The presence of duplicate records is a major data quality concern in large databases. To detect duplicates, entity resolution also known as duplication detection or record linkage is used as a part of the data cleaning process to identify records that potentially refer to the same real-world entity....
Gespeichert in:
Veröffentlicht in: | Proceedings of the VLDB Endowment 2009-08, Vol.2 (1), p.1282-1293 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The presence of duplicate records is a major data quality concern in large databases. To detect duplicates,
entity resolution
also known as
duplication detection
or
record linkage
is used as a part of the data cleaning process to identify records that potentially refer to the same real-world entity. We present the Stringer system that provides an evaluation framework for understanding what barriers remain towards the goal of truly scalable and general purpose duplication detection algorithms. In this paper, we use Stringer to evaluate the quality of the clusters (groups of potential duplicates) obtained from several unconstrained clustering algorithms used in concert with approximate join techniques. Our work is motivated by the recent significant advancements that have made approximate join algorithms highly scalable. Our extensive evaluation reveals that some clustering algorithms that have never been considered for duplicate detection, perform extremely well in terms of both accuracy and scalability. |
---|---|
ISSN: | 2150-8097 2150-8097 |
DOI: | 10.14778/1687627.1687771 |