The use of resampling techniques to overcome imbalance of data on the classification algorithm

Imbalance of dataset in the accuracy testing process can lead to biased results. It occurs due to insufficient data in the training phase where unbalanced data causes problems in Machine Learning. Classification and predicting results become difficult when there is insufficient data to study. To ove...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Aryanti, Riska, Arifin, Yoseph Tajul, Khairunas, Sayyid, Misriati, Titik, Dalis, Sopiyan, Baidawi, Taufik, Safitri, Rizky Ade, Marlina, Siti
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Accuracy Algorithms Classification Datasets Machine learning Resampling Sampling methods Sampling techniques Support vector machines
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Imbalance of dataset in the accuracy testing process can lead to biased results. It occurs due to insufficient data in the training phase where unbalanced data causes problems in Machine Learning. Classification and predicting results become difficult when there is insufficient data to study. To overcome this, it takes steps in balancing the data, one of which is the random over sampling technique. The basic principle of using this technique is to rebalance an unbalanced data set with a concrete strategy. The use of sampling technique in the case of data imbalance is proven to be able to improve the performance of the algorithm. The results of testing the KNN, Naive Bayes, SVM, J.48 and Random Forest algorithms using 10 fold cross validation on the public dataset of early stage diabetes risk prediction from the Hospital in Sylhet, Bangladesh after going through the re-sample stage proved to be able to improve measurement results with a high level of accuracy. The highest using KNN with an accuracy of 99.4231%, while the results of the J.48 algorithm test with an accuracy rate of 99.2308%, Random Forest with an accuracy rate of 98.0769%, SVM with an accuracy rate of 96.5385% and Naïve Bayes with an accuracy rate of 90.3846%.
ISSN:	0094-243X 1551-7616
DOI:	10.1063/5.0128424