Instance Selection via Voronoi Neighbors for Binary Classification Tasks

Large datasets available in many applications have enabled the training of binary classifiers to match or even outperform humans. However, the large volume of data introduces computational burden during the training and calibration of model parameters. Since the optimal decision surface (ODS) of a c...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on knowledge and data engineering 2024-08, Vol.36 (8), p.3921-3933
Hauptverfasser: Fu, Ying, Liu, Kaibo, Zhu, Wenbin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Large datasets available in many applications have enabled the training of binary classifiers to match or even outperform humans. However, the large volume of data introduces computational burden during the training and calibration of model parameters. Since the optimal decision surface (ODS) of a classification task is often determined by a few nearby instances, a novel PDOC-V method is proposed to identify them. A Bayesian probability model is adopted to describe the ODS. An instance is close to the ODS if its probability of belonging to the positive and negative classes is similar. The probabilities of an instance are estimated by partitioning the input space into cells containing a single instance via the Voronoi diagram and inspecting its Voronoi neighbors. A randomized ray shooting algorithm is adopted to accelerate our algorithm. In many natural datasets, the spatial distribution of instances is often uneven. For such datasets, our method is more robust than existing distance-based instance selection methods. Comprehensive experiments suggest that common classifiers trained on instances selected by PDOC-V can accurately recover the ODS. Moreover, for many natural datasets, common classifiers trained on 10% - 20% of instances can achieve more than 98% of the full set performance.
ISSN:1041-4347
1558-2191
DOI:10.1109/TKDE.2023.3328952