Big data classification using heterogeneous ensemble classifiers in Apache Spark based on MapReduce paradigm
•Distributed Heterogeneous Ensemble is designed for big data classification.•Classifiers are pruned from the ensemble to increase the diversity.•A Spark version of DHBoost is presented based on MapReduce programming paradigm.•DHBoost outperforms the state-of-the-art ensemble classifiers in the Spark...
Gespeichert in:
Veröffentlicht in: | Expert systems with applications 2021-11, Vol.183, p.115369, Article 115369 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | •Distributed Heterogeneous Ensemble is designed for big data classification.•Classifiers are pruned from the ensemble to increase the diversity.•A Spark version of DHBoost is presented based on MapReduce programming paradigm.•DHBoost outperforms the state-of-the-art ensemble classifiers in the Spark library.
In this era of big data, processing large scale data efficiently and accurately has become a challenging problem. Ensemble classification is a type of supervised learning that uses multiple experts to generate the final output. It provides a way to classify data more accurately. As a result of using multiple classifiers, they are often more complicated than single classifiers, especially for big data problems. Apache Spark is a unified analytics engine for big data processing which provides a scalable framework to analyze the data. In this paper, we first extend our previous work and design a distributed heterogeneous ensemble classifier inspired by the boosting approach, which is capable of dealing with big datasets. Using heterogeneous classifiers makes it possible to have more diverse classifiers, and consequently, a more accurate classifier is obtained. Then, we present the Spark version of the proposed approach to speed up our heterogeneous ensemble classifier using the MapReduce paradigm. In order to evaluate our approach, we have applied it to seven big datasets. Extensive experimental results indicate the superiority of the proposed method over the existing ensemble algorithms implemented by Spark MLlib in terms of the classification accuracy, performance, and scalability. |
---|---|
ISSN: | 0957-4174 1873-6793 |
DOI: | 10.1016/j.eswa.2021.115369 |