Binary classification on imbalanced data: A case study for birth events in Indonesia
Classification for binary imbalanced class data is still an interesting topic. Especially in the case of classification which is based on the data-driven approach. By this approach, there is often an imbalance in the target class of classification. Therefore, the study of class imbalance is inelucta...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Tagungsbericht |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Classification for binary imbalanced class data is still an interesting topic. Especially in the case of classification which is based on the data-driven approach. By this approach, there is often an imbalance in the target class of classification. Therefore, the study of class imbalance is ineluctable. In this study, we classified birth events for the Indonesia Demographic and Health Survey (DHS) 2017 data. We implemented machine learning algorithms, i.e. Logistic Regression (LR) and Support Vector Machine (SVM) classifiers to classify the birth event for women in Indonesia. Several resampling techniques were applied including Undersampling, Oversampling, and Hybrid to rebalance the data distribution. The performance of each technique was evaluated based on several evaluation metrics. We used Accuracy, Sensitivity, F1-Score, Area Under Curve, and Geometric mean to evaluate the classification results. A significant discrepancy in the score of evaluation metrics was found between the methods when the LR and SVM classifiers were employed. Precisely, the evaluation score metrics are high for the balanced data obtained from Undersampling techniques, i.e., Nearmiss-1 for LR classifier and NCL for SVM classifier. The value of Accuracy, Sensitivity, F1-Score, Area Under Curve, and Geometric mean for Nearmiss-1 are 0.9859, 0.9720, 0.9858, 0.9860, 0.9859, respectively. Then for NCL the score of evaluation metrics are 0.9829, 0.9767, 0.9882, 0.9884, 0.9883, respectively. Overall, Undersampling techniques gave higher evaluation score metrics than Oversampling techniques and Hybrid techniques for Indonesia DHS 2017 imbalanced classification. |
---|---|
ISSN: | 0094-243X 1551-7616 |
DOI: | 10.1063/5.0118994 |