A GPU Scheduling Framework to Accelerate Hyper-Parameter Optimization in Deep Learning Clusters

This paper proposes Hermes, a container-based preemptive GPU scheduling framework for accelerating hyper-parameter optimization in deep learning (DL) clusters. Hermes accelerates hyper-parameter optimization by time-sharing between DL jobs and prioritizing jobs with more promising hyper-parameter co...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Electronics (Basel) 2021-02, Vol.10 (3), p.350
Hauptverfasser:	Son, Jaewon, Yoo, Yonghyuk, Kim, Khu-rai, Kim, Youngjae, Lee, Kwonyong, Park, Sungyong
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Clusters Containers Convergence Deep learning Feedback Machine learning Natural language processing Optimization Parameters Preempting Preemption Scheduling Success Training
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This paper proposes Hermes, a container-based preemptive GPU scheduling framework for accelerating hyper-parameter optimization in deep learning (DL) clusters. Hermes accelerates hyper-parameter optimization by time-sharing between DL jobs and prioritizing jobs with more promising hyper-parameter combinations. Hermes’s scheduling policy is grounded on the observation that good hyper-parameter combinations converge quickly in the early phases of training. By giving higher priority to fast-converging containers, Hermes’s GPU preemption mechanism can accelerate training. This enables users to find optimal hyper-parameters faster without losing the progress of a container. We have implemented Hermes over Kubernetes and compared its performance against existing scheduling frameworks. Experiments show that Hermes reduces the time for hyper-parameter optimization up to 4.04 times against previously proposed scheduling policies such as FIFO, round-robin (RR), and SLAQ, with minimal time-sharing overhead.
ISSN:	2079-9292 2079-9292
DOI:	10.3390/electronics10030350