Deep Learning Workload Scheduling in GPU Datacenters: A Survey

Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU datacenter. An efficient scheduler design for a...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ACM computing surveys 2024-06, Vol.56 (6), p.1-38, Article 146
Hauptverfasser:	Ye, Zhisheng, Gao, Wei, Hu, Qinghao, Sun, Peng, Wang, Xiaolin, Luo, Yingwei, Zhang, Tianwei, Wen, Yonggang
Format:	Artikel
Sprache:	eng
Schlagworte:	Cloud computing Computer science Computer systems organization Computing methodologies Data centers Deep learning General and reference Machine learning Resource scheduling Resource utilization Scheduling Surveys and overviews Workload Workloads
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU datacenter. An efficient scheduler design for a GPU datacenter is crucially important to reduce operational cost and improve resource utilization. However, traditional approaches designed for big data or high-performance computing workloads can not support DL workloads to fully utilize the GPU resources. Recently, many schedulers are proposed to tailor for DL workloads in GPU datacenters. This article surveys existing research efforts for both training and inference workloads. We primarily present how existing schedulers facilitate the respective workloads from the scheduling objectives and resource utilization manner. Finally, we discuss several promising future research directions including emerging DL workloads, advanced scheduling decision making, and underlying hardware resources. A more detailed summary of the surveyed paper and code links can be found at our project website: https://github.com/S-Lab-System-Group/Awesome-DL-Scheduling-Papers
ISSN:	0360-0300 1557-7341
DOI:	10.1145/3638757