OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning
Large-scale deep learning models contribute to significant performance improvements on varieties of downstream tasks. Current data and model parallelism approaches utilize model replication and partition techniques to support the distributed training of ultra-large models. However, directly deployin...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Large-scale deep learning models contribute to significant performance
improvements on varieties of downstream tasks. Current data and model
parallelism approaches utilize model replication and partition techniques to
support the distributed training of ultra-large models. However, directly
deploying these systems often leads to sub-optimal training efficiency due to
the complex model architectures and the strict device memory constraints. In
this paper, we propose Optimal Sharded Data Parallel (OSDP), an automated
parallel training system that combines the advantages from both data and model
parallelism. Given the model description and the device information, OSDP makes
trade-offs between the memory consumption and the hardware utilization, thus
automatically generates the distributed computation graph and maximizes the
overall system throughput. In addition, OSDP introduces operator splitting to
further alleviate peak memory footprints during training with negligible
overheads, which enables the trainability of larger models as well as the
higher throughput. Extensive experimental results of OSDP on multiple different
kinds of large-scale models demonstrate that the proposed strategy outperforms
the state-of-the-art in multiple regards. |
---|---|
DOI: | 10.48550/arxiv.2305.09940 |