An Information Theory-Based Feature Selection Framework for Big Data Under Apache Spark

With the advent of extremely high dimensional datasets, dimensionality reduction techniques are becoming mandatory. Of the many techniques available, feature selection (FS) is of growing interest for its ability to identify both relevant features and frequently repeated instances in huge datasets. W...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on systems, man, and cybernetics. Systems man, and cybernetics. Systems, 2018-09, Vol.48 (9), p.1441-1453
Hauptverfasser: Ramirez-Gallego, Sergio, Mourino-Talin, Hector, Martinez-Rego, David, Bolon-Canedo, Veronica, Benitez, Jose Manuel, Alonso-Betanzos, Amparo, Herrera, Francisco
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:With the advent of extremely high dimensional datasets, dimensionality reduction techniques are becoming mandatory. Of the many techniques available, feature selection (FS) is of growing interest for its ability to identify both relevant features and frequently repeated instances in huge datasets. We aim to demonstrate that standard FS methods can be parallelized in big data platforms like Apache Spark so as to boost both performance and accuracy. We propose a distributed implementation of a generic FS framework that includes a broad group of well-known information theory-based methods. Experimental results for a broad set of real-world datasets show that our distributed framework is capable of rapidly dealing with ultrahigh-dimensional datasets as well as those with a huge number of samples, outperforming the sequential version in all the cases studied.
ISSN:2168-2216
2168-2232
DOI:10.1109/TSMC.2017.2670926