When is resampling beneficial for feature selection with imbalanced wide data?

This paper studies the effects that combinations of balancing and feature selection techniques have on wide data (many more attributes than instances) when different classifiers are used. For this, an extensive study is done using 14 datasets, 3 balancing strategies, and 7 feature selection algorith...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Expert systems with applications 2022-02, Vol.188, p.116015, Article 116015
Hauptverfasser:	Ramos-Pérez, Ismael, Arnaiz-González, Álvar, Rodríguez, Juan J., García-Osorio, César
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Balancing Classifiers Feature selection High dimensional data Machine learning Resampling Unbalanced Very low sample size Wide data
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This paper studies the effects that combinations of balancing and feature selection techniques have on wide data (many more attributes than instances) when different classifiers are used. For this, an extensive study is done using 14 datasets, 3 balancing strategies, and 7 feature selection algorithms. The evaluation is carried out using 5 classification algorithms, analyzing the results for different percentages of selected features, and establishing the statistical significance using Bayesian tests. Some general conclusions of the study are that it is better to use RUS before the feature selection, while ROS and SMOTE offer better results when applied afterwards. Additionally, specific results are also obtained depending on the classifier used, for example, for Gaussian SVM the best performance is obtained when the feature selection is done with SVM-RFE before balancing the data with RUS. •Wide datasets usually suffer from unbalanced classes distributions.•Feature selection (FS) is commonly recommended for wide datasets.•We aim to find the best combination and order to apply FS and resampling.•14 datasets, 5 classifiers, 7 FS, and 7 balancing strategies were tested.•The best configuration was SVM-RFE used before RUS for the SVM-G classifier.
ISSN:	0957-4174 1873-6793
DOI:	10.1016/j.eswa.2021.116015