Towards Aggregated Asynchronous Checkpointing
High-Performance Computing (HPC) applications need to checkpoint massive amounts of data at scale. Multi-level asynchronous checkpoint runtimes like VELOC (Very Low Overhead Checkpoint Strategy) are gaining popularity among application scientists for their ability to leverage fast node-local storage...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | High-Performance Computing (HPC) applications need to checkpoint massive
amounts of data at scale. Multi-level asynchronous checkpoint runtimes like
VELOC (Very Low Overhead Checkpoint Strategy) are gaining popularity among
application scientists for their ability to leverage fast node-local storage
and flush independently to stable, external storage (e.g., parallel file
systems) in the background. Currently, VELOC adopts a one-file-per-process
flush strategy, which results in a large number of files being written to
external storage, thereby overwhelming metadata servers and making it difficult
to transfer and access checkpoints as a whole. This paper discusses the
viability and challenges of designing aggregation techniques for asynchronous
multi-level checkpointing. To this end we implement and study two aggregation
strategies, their limitations, and propose a new aggregation strategy
specifically for asynchronous multi-level checkpointing. |
---|---|
DOI: | 10.48550/arxiv.2112.02289 |