The effects of data balancing approaches: A case study

Imbalanced datasets affect the performance of machine learning algorithms adversely. To cope with this problem, several resampling methods have been developed recently. In this article, we present a case study approach for investigating the effects of data balancing approaches. The case study concer...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Applied soft computing 2023-01, Vol.132, p.109853, Article 109853
Hauptverfasser:	Mooijman, Paul, Catal, Cagatay, Tekinerdogan, Bedir, Lommen, Arjen, Blokland, Marco
Format:	Artikel
Sprache:	eng
Schlagworte:	Cattle Classification Feature selection Hormone abuse detection Imbalanced dataset LC–MS Missing data Resampling Supervised machine learning
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Imbalanced datasets affect the performance of machine learning algorithms adversely. To cope with this problem, several resampling methods have been developed recently. In this article, we present a case study approach for investigating the effects of data balancing approaches. The case study concerns the discrimination between growth hormone treated and non-treated animals using Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) data. Our LC-HRMS dataset contains 1241 bovine urine samples, of which only 65 specimens were from animal studies and guaranteed to contain growth-stimulating hormones while the rest has been reported to be untreated, making it a ∼5% imbalanced dataset. In this research, classification algorithms, combined with resampling strategies and dimensionality reduction methods, were investigated to find a prediction model to correctly identify the samples of treated animals. Furthermore, to cope with a large number of missing data points in the given dataset, a replacement with random low values strategy was applied. Our results showed that the replacement method was effective, and LogisticRegression combined with the oversampling algorithms SMOTE or ADASYN, GaussianProcessClassifier with the oversampling algorithm SMOTE, and LinearDiscriminantAnalysis were the best performing models after log transformation of the dataset was followed by Recursive Feature Elimination. •The model discriminating between growth-hormone treated and non-treated animals.•New dataset using Liquid Chromatography-High Resolution Mass Spectrometry data.•Many experiments with different algorithms including data balancing approaches.
ISSN:	1568-4946 1872-9681
DOI:	10.1016/j.asoc.2022.109853