EFFICIENT DUPLICATE DETECTION FOR MACHINE LEARNING DATA SETS

At a machine learning service, a determination is made that an analysis to detect whether at least a portion of contents of one or more observation records of a first data set are duplicated in a second set of observation records is to be performed. A duplication metric is obtained, indicative of a...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Ingerman, Aleksandr Mikhaylovich, Dirac, Leo Parker
Format: Patent
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:At a machine learning service, a determination is made that an analysis to detect whether at least a portion of contents of one or more observation records of a first data set are duplicated in a second set of observation records is to be performed. A duplication metric is obtained, indicative of a non-zero probability that one or more observation records of the second set are duplicates of respective observation records of the first set. In response to determining that the duplication metric meets a threshold criterion, one or more responsive actions are initiated, such as the transmission of a notification to a client of the service.