Adaptive load balancing in cluster computing environment
Owing to the availability of high computing servers and clusters to process the data, factors such as data skewness, class imbalance, and scalability in big data cause slow processing performance. This study proposes a framework for load balancing in the Apache Spark cluster that makes efficient use...
Gespeichert in:
Veröffentlicht in: | The Journal of supercomputing 2023-11, Vol.79 (17), p.20179-20207 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Owing to the availability of high computing servers and clusters to process the data, factors such as data skewness, class imbalance, and scalability in big data cause slow processing performance. This study proposes a framework for load balancing in the Apache Spark cluster that makes efficient use of cluster resources and improves overall processing performance. The proposed method configures the Apache Spark cluster initially, to fix the optimal number of CPU cores and memory for each executor. The proposed scheme explores the trade-off between workload balance and communication efficiency, while for dynamic task allocation, coarse-grained and fine-grained data placement strategies are being used. A coarse-grained strategy handles the execution of datasets with larger partitions, but comparatively few executors, by transforming resilient distributed datasets (RDDs) into smaller datasets. Fine-grained strategy handles the execution of datasets with a large number of executors in comparison with partitions; the number of executors is considered equivalent to the number of knapsacks, and the resulting multidimensional knapsack problem is solved using particle swarm optimization to join the data partitions. The partitions were transformed into RDDs equivalent to the number of executors. All experiments were carried out using a large-scale dataset comprised of Amazon product reviews data in JSON format. The fine-grained and coarse-grained data placement strategies were found to be 26.12% and 42% faster in terms of execution time compared to the default data placement approach. |
---|---|
ISSN: | 0920-8542 1573-0484 |
DOI: | 10.1007/s11227-023-05434-6 |