A Hybrid Sampling Approach for Imbalanced Binary and Multi-Class Data Using Clustering Analysis
Unequal data distribution among different classes usually cause a class imbalance problem. Due to the class imbalance, the classification models become biased toward the majority class and misclassify the minority class. Class imbalance issue becomes more complex when it occurs in multi-class data....
Gespeichert in:
Veröffentlicht in: | IEEE access 2022, Vol.10, p.118639-118653 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Unequal data distribution among different classes usually cause a class imbalance problem. Due to the class imbalance, the classification models become biased toward the majority class and misclassify the minority class. Class imbalance issue becomes more complex when it occurs in multi-class data. The most common method to handle the class imbalance is data resampling that involves either over-sampling minority class instances or under-sampling majority class instances. In the case of under-sampling, there is a chance of losing some crucial information, whereas over-sampling can cause an overfitting problem. Therefore, we propose a novel Cluster-based Hybrid Sampling for Imbalance Data (CBHSID) approach to address these issues. The CBHSID calculates the mean of the data observations based on the number of classes. It uses the calculated mean as a threshold value to segregate majority and minority classes. CBHSID applies affinity propagation cluster analysis to each class to create sub-clusters and calculates the distance of each data item of sub-cluster using centroid mean. CBHSID removes data observations that are away from the center of sub-cluster during under-sampling. On the other hand, during the over-sampling, it generates synthetic samples using data observations near to the center of sub-cluster. We compared CBHSID with a few state-of-the-art data balancing methods on 12 binary and 4 multi-class benchmark datasets. Based on Geometric-Mean (G-Mean), Recall, and F1-score, our method outperformed the other compared methods on 14 datasets out of 16. Results also revealed that CBHSID is suitable for addressing class imbalance issues in both binary and multi-class classifications. In the current state, we have only validated CBHSID on stationary data streams. Consequently, CBHSID can further be tested on non-stationary data streams in online learning environments. |
---|---|
ISSN: | 2169-3536 2169-3536 |
DOI: | 10.1109/ACCESS.2022.3218463 |