Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA
The dedicated memory of hardware accelerators can be insufficient to store all weights and/or intermediate states of large deep learning models. Although model parallelism is a viable approach to reduce the memory pressure issue, significant modification of the source code and considerations for alg...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The dedicated memory of hardware accelerators can be insufficient to store
all weights and/or intermediate states of large deep learning models. Although
model parallelism is a viable approach to reduce the memory pressure issue,
significant modification of the source code and considerations for algorithms
are required. An alternative solution is to use out-of-core methods instead of,
or in addition to, data parallelism. We propose a performance model based on
the concurrency analysis of out-of-core training behavior, and derive a
strategy that combines layer swapping and redundant recomputing. We achieve an
average of 1.52x speedup in six different models over the state-of-the-art
out-of-core methods. We also introduce the first method to solve the
challenging problem of out-of-core multi-node training by carefully pipelining
gradient exchanges and performing the parameter updates on the host. Our data
parallel out-of-core solution can outperform complex hybrid model parallelism
in training large models, e.g. Megatron-LM and Turning-NLG. |
---|---|
DOI: | 10.48550/arxiv.2008.11421 |