TRANSPARENT PRE-EMPTION AND MIGRATION FOR PLANET-SCALE COMPUTER

The disclosure herein describes platform-level checkpointing for deep learning (DL) jobs. The checkpointing is performed through capturing two kinds of state data: (i) GPU state (device state), and (ii) CPU state (host state). The GPU state includes GPU data (e.g., model parameters, optimizer state,...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: RANGANATHAN, Bhalakumaaran Erode, SIVATHANU, Muthian, SHARMA, Pankaj, VISWANATHA, Srinidhi, KWATRA, Nipun, SHARMA, Vaibhav, SHUKLA, Dharma Kiritkumar, RAMJEE, Ramachandran, NEHME, Rimma Vladimirovna
Format: Patent
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The disclosure herein describes platform-level checkpointing for deep learning (DL) jobs. The checkpointing is performed through capturing two kinds of state data: (i) GPU state (device state), and (ii) CPU state (host state). The GPU state includes GPU data (e.g., model parameters, optimizer state, etc.) that is located in the GPU and GPU context (e.g., the default stream in GPU, various handles created by the libraries such as DNN, Blas, etc.). Only a fraction of the GPU memory is copied because the checkpointing is done in a domain-aware manner. The "active" memory contains useful data like model parameters. To be able to capture the useful data, memory management is controlled to identify which parts of the memory are active. Also, to restore the destination GPU to the same context/state, a mechanism is used to capture such state-changing events on an original GPU and replayed on a destination GPU.