A Network Load Perception Based Task Scheduler for Parallel Distributed Data Processing Systems
In parallel distributed data processing frameworks like Spark and Flink, task scheduling has a great impact on cluster performance. Though task Scheduling has proven to be an NP-complete problem, a large number of researchers have proposed many heuristic rules to obtain approximate optimal solutions...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on cloud computing 2023-04, Vol.11 (2), p.1352-1364 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In parallel distributed data processing frameworks like Spark and Flink, task scheduling has a great impact on cluster performance. Though task Scheduling has proven to be an NP-complete problem, a large number of researchers have proposed many heuristic rules to obtain approximate optimal solutions. But most of them ignore the fact that the resource requirements of tasks are dynamically changing during its runtime. Considering the overall task entire lives, the CPU utilization is often lower during the data transfer. Especially for most distributed data processing platforms, data transmission is time-consuming, which usually resulting in low overall CPU utilization. Similarly, network throughput during task calculations is also low in some cases. In this article, we propose a network load variation perception based heuristic task scheduling algorithm, and based on this implement a dual-phase pipeline task scheduler (D2PTS) from the perspective of dynamic resource requirements that aims at maximizing cluster resource utilization, as a supplement to existing data-parallel frameworks. D2PTS divides the states of task into two phases: network-intensive and network-free. To improve the overall resource utilities, this article proposes different algorithms to evaluate the execution time of network sensitive and network free phases respectively. When an executing task is in the network-free phase, D2PTS can additionally schedule a new network-intensive task at the right time. Under this scheduling policy, the two tasks sharing the same CPU core can be executed as a coarse-grained pipeline. This execution method can start tasks earlier and improve resource utilization. Finally, we have implemented our model prototype on Spark 2.4.3 and conducted a number of experiments to evaluate the performance of our model. Experimental results show that D2PTS can not only minimize application makespan, but also improve resource utilization. |
---|---|
ISSN: | 2168-7161 2372-0018 |
DOI: | 10.1109/TCC.2021.3132627 |