Method and computer program product for training a pairwise classifier for use in entity resolution in large data sets
A collection of clusters are selected to be used in training in an active learning workflow when using clusters to train supervised entity resolution in data sets. A collection of records is provided wherein each record in the collection has a cluster membership. A collection of record pairs is also...
Gespeichert in:
1. Verfasser: | |
---|---|
Format: | Patent |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | A collection of clusters are selected to be used in training in an active learning workflow when using clusters to train supervised entity resolution in data sets. A collection of records is provided wherein each record in the collection has a cluster membership. A collection of record pairs is also provided, each record pair containing two distinct records from the collection of records, and each record pair having a similarity score. A collection of clusters is generated with uncertainty from the collection of records and the collection of record pairs. A subset of the collection of clusters with uncertainty is then selected using weighted sampling, wherein a function of the cluster uncertainty is used as the weight in the weighted sampling. The subset of the collection of clusters with uncertainty is the collection of clusters for training in and active learning workflow when using clusters to train supervised entity resolution in data sets. |
---|