A real-time and reliable dynamic migration model for concurrent taskflow in a GPU cluster

High performance GPU clusters are widely used for massive amount of concurrent dataflow processing, and have higher requirements for real-time, reliability and flexibility. However, the higher computational intensiveness and resources utilization lead to excessively high system temperature and power...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Cluster computing 2019-06, Vol.22 (2), p.585-599
Hauptverfasser: Fang, Yuling, Chen, Qingkui
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:High performance GPU clusters are widely used for massive amount of concurrent dataflow processing, and have higher requirements for real-time, reliability and flexibility. However, the higher computational intensiveness and resources utilization lead to excessively high system temperature and power consumption, and even result in instantaneous failures. In this paper, we present a real-time and efficient dynamic taskflow migration approach (DTMA) based on a computing cluster. Firstly, we propose our basic theoretical models. Among them, the cluster communication model elaborates on all the communication paths and calculates the communication overhead of different migration modes. Secondly, on the basis of theoretical models and multiple instances analysis, our taskflow migration rules are summarized, and the rules help to balance cluster resources utilization and improve the overall performance of GPUs. Thirdly, the DTMA adjusts the cluster task allocation by utilizing performance and power consumption aware migration approach. This is done to reduce single node power consumption and enhance system reliability by shifting the current GPU load to other available GPU (GPUs). Moreover, the DTMA uses a circular queue to store resources information of available GPUs for better task scheduling. We evaluate the effect of DTMA through analyzing power consumption, temperature, fan speed and migration cost with different experiments. The experiment results demonstrate that DTMA is able to improve the performance and reliability of our cluster computing system, and reduce instantaneous failures.
ISSN:1386-7857
1573-7543
DOI:10.1007/s10586-018-2866-8