Big data classification using heterogeneous ensemble classifiers in Apache Spark based on MapReduce paradigm

•Distributed Heterogeneous Ensemble is designed for big data classification.•Classifiers are pruned from the ensemble to increase the diversity.•A Spark version of DHBoost is presented based on MapReduce programming paradigm.•DHBoost outperforms the state-of-the-art ensemble classifiers in the Spark...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Expert systems with applications 2021-11, Vol.183, p.115369, Article 115369
Hauptverfasser:	Kadkhodaei, Hamidreza, Eftekhari Moghadam, Amir Masoud, Dehghan, Mehdi
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Apache Hadoop Apache Spark Big Data Boosting Classification Classifiers Data processing Datasets Ensemble classifier MapReduce
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•Distributed Heterogeneous Ensemble is designed for big data classification.•Classifiers are pruned from the ensemble to increase the diversity.•A Spark version of DHBoost is presented based on MapReduce programming paradigm.•DHBoost outperforms the state-of-the-art ensemble classifiers in the Spark library. In this era of big data, processing large scale data efficiently and accurately has become a challenging problem. Ensemble classification is a type of supervised learning that uses multiple experts to generate the final output. It provides a way to classify data more accurately. As a result of using multiple classifiers, they are often more complicated than single classifiers, especially for big data problems. Apache Spark is a unified analytics engine for big data processing which provides a scalable framework to analyze the data. In this paper, we first extend our previous work and design a distributed heterogeneous ensemble classifier inspired by the boosting approach, which is capable of dealing with big datasets. Using heterogeneous classifiers makes it possible to have more diverse classifiers, and consequently, a more accurate classifier is obtained. Then, we present the Spark version of the proposed approach to speed up our heterogeneous ensemble classifier using the MapReduce paradigm. In order to evaluate our approach, we have applied it to seven big datasets. Extensive experimental results indicate the superiority of the proposed method over the existing ensemble algorithms implemented by Spark MLlib in terms of the classification accuracy, performance, and scalability.
ISSN:	0957-4174 1873-6793
DOI:	10.1016/j.eswa.2021.115369