Active learning of record matching packages

An active learning record matching system and method for producing a record matching package that is used to identify pairs of duplicate records. Embodiments of the system and method allow a precision threshold to be specified and then generate a learned record matching package having precision grea...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: ARASU ARVIND, GÖTZ MICHAELA, KAUSHIK SHRIRAGHAV
Format: Patent
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:An active learning record matching system and method for producing a record matching package that is used to identify pairs of duplicate records. Embodiments of the system and method allow a precision threshold to be specified and then generate a learned record matching package having precision greater than this threshold and a recall close to the best possible recall. Embodiments of the system and method use a blocking technique to restrict the space of record matching packages considered and scale to large inputs. The learning method considers several record matching packages, estimates the precision and recall of the packages, and identifies the package with maximum recall having precision greater than equal to the given precision threshold. A human domain expert labels a sample of record pairs in the output of the package as matches or non-matches and this labeling is used to estimate the precision of the package.