Handling shift and irregularities in data through sequential ellipsoidal partitioning
Data irregularities, namely small disjuncts, class skew, imbalance, and outliers significantly affect the performance of classifiers. Another challenge posed to classifiers is when new unlabelled data have different characteristics than the training data; this change is termed as a data shift. In th...
Gespeichert in:
Veröffentlicht in: | Data-Centric Engineering (Online) 2024-11, Vol.5 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Data irregularities, namely small disjuncts, class skew, imbalance, and outliers significantly affect the performance of classifiers. Another challenge posed to classifiers is when new unlabelled data have different characteristics than the training data; this change is termed as a data shift. In this paper, we focus on identifying small disjuncts and dataset shift using the supervised classifier, sequential ellipsoidal partitioning classifier (SEP-C). This method iteratively partitions the dataset into minimum-volume ellipsoids that contain points of the same label, based on the idea of Reduced Convex Hulls. By allowing an ellipsoid that contains points of one label to contain a few points of the other, such small disjuncts may be identified. Similarly, if new points are accommodated only by expanding one or more of the ellipsoids, then shifts in data can be identified. Small disjuncts are distribution-based irregularities that may be considered as being rare but more error-prone than large disjuncts. Eliminating small disjuncts by removal or pruning is seen to affect the learning of the classifier adversely. Dataset shifts have been identified using Bayesian methods, use of confidence scores, and thresholds—these require prior knowledge of the distributions or heuristics. SEP-C is agnostic of the underlying data distributions, uses a single hyperparameter, and as ellipsoidal partitions are generated, well-known statistical tests can be performed to detect shifts in data; it is also applicable as a supervised classifier when the datasets are highly skewed and imbalanced. We demonstrate the performance of SEP-C with UCI, MNIST handwritten digit image, and synthetically generated datasets. |
---|---|
ISSN: | 2632-6736 |
DOI: | 10.1017/dce.2024.41 |