Scalable binning for big data deduplication

A very efficient computer system is presented to generate all pairs of records that have a certain similarity. Similarity is defined in terms of the textual similarity of the record attributes and/or absolute difference for numeric record attributes. Software assigns each record to a number of bins,...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Ilyas, Ihab F, Beskales, George
Format: Patent
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:A very efficient computer system is presented to generate all pairs of records that have a certain similarity. Similarity is defined in terms of the textual similarity of the record attributes and/or absolute difference for numeric record attributes. Software assigns each record to a number of bins, and then compares pairs of records that belong to the same bin. This is more efficient than comparing all pairs of records since the number of records compared to each other is much smaller.