A Bilevel Optimization Framework for Imbalanced Data Classification
Data rebalancing techniques, including oversampling and undersampling, are a common approach to addressing the challenges of imbalanced data. To tackle unresolved problems related to both oversampling and undersampling, we propose a new undersampling approach that: (i) avoids the pitfalls of noise a...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Data rebalancing techniques, including oversampling and undersampling, are a
common approach to addressing the challenges of imbalanced data. To tackle
unresolved problems related to both oversampling and undersampling, we propose
a new undersampling approach that: (i) avoids the pitfalls of noise and overlap
caused by synthetic data and (ii) avoids the pitfall of under-fitting caused by
random undersampling. Instead of undersampling majority data randomly, our
method undersamples datapoints based on their ability to improve model loss.
Using improved model loss as a proxy measurement for classification
performance, our technique assesses a datapoint's impact on loss and rejects
those unable to improve it. In so doing, our approach rejects majority
datapoints redundant to datapoints already accepted and, thereby, finds an
optimal subset of majority training data for classification. The accept/reject
component of our algorithm is motivated by a bilevel optimization problem
uniquely formulated to identify the optimal training set we seek. Experimental
results show our proposed technique with F1 scores up to 10% higher than
state-of-the-art methods. |
---|---|
DOI: | 10.48550/arxiv.2410.11171 |