Check-QZP: A Lightweight Checkpoint Mechanism for Deep Learning Frameworks

In deep learning (DL) frameworks, a checkpoint operation is widely used to store intermediate variable values (e.g., weights, biases, and gradients) on storage media. This operation helps to reduce the recovery time of running a machine learning (ML) model after sudden power failures or random crash...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Applied sciences 2024-10, Vol.14 (19), p.8848
Hauptverfasser: Lee, Sangheon, Moon, Gyupin, Lee, Chanyong, Kim, Hyunwoo, An, Donghyeok, Kang, Donghyun
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In deep learning (DL) frameworks, a checkpoint operation is widely used to store intermediate variable values (e.g., weights, biases, and gradients) on storage media. This operation helps to reduce the recovery time of running a machine learning (ML) model after sudden power failures or random crashes. However, the checkpoint operation can stall the overall training step of the running model and waste expensive hardware resources by leaving the GPU in idle sleep during the checkpoint operation. In addition, the completion time of the checkpoint operation is unpredictable in cloud server environments (e.g., AWS and Azure) because excessive I/O operations issued by other running applications interfere with the checkpoint operations in the storage stacks. To efficiently address the above two problems, we carefully designed Check-QZP, which reduces the amount of data required for checkpoint operations and parallelizes executions on the CPU and GPU by understanding the internal behaviors of the training step. For the evaluation, we implemented Check-QZP and compared it with the traditional approach in real-world multi-tenant scenarios. In the evaluation, Check-QZP outperformed the baseline in all cases in terms of the overall checkpoint time and the amount of data generated by the checkpoint operations, reducing them by up to 87.5% and 99.8%, respectively. In addition, Check-QZP achieved superior training speeds compared to the baseline.
ISSN:2076-3417
2076-3417
DOI:10.3390/app14198848