On Optimizing the Communication of Model Parallelism
We study a novel and important communication pattern in large-scale model-parallel deep learning (DL), which we call cross-mesh resharding. This pattern emerges when the two paradigms of model parallelism - intra-operator and inter-operator parallelism - are combined to support large models on large...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We study a novel and important communication pattern in large-scale
model-parallel deep learning (DL), which we call cross-mesh resharding. This
pattern emerges when the two paradigms of model parallelism - intra-operator
and inter-operator parallelism - are combined to support large models on large
clusters. In cross-mesh resharding, a sharded tensor needs to be sent from a
source device mesh to a destination device mesh, on which the tensor may be
distributed with the same or different layouts. We formalize this as a
many-to-many multicast communication problem, and show that existing approaches
either are sub-optimal or do not generalize to different network topologies or
tensor layouts, which result from different model architectures and parallelism
strategies. We then propose two contributions to address cross-mesh resharding:
an efficient broadcast-based communication system, and an
"overlapping-friendly" pipeline schedule. On microbenchmarks, our overall
system outperforms existing ones by up to 10x across various tensor and mesh
layouts. On end-to-end training of two large models, GPT-3 and U-Transformer,
we improve throughput by 10% and 50%, respectively. |
---|---|
DOI: | 10.48550/arxiv.2211.05322 |