Rethinking Memory and Communication Cost for Efficient Large Language Model Training
Recently, various distributed strategies for large language model training have been proposed. However, these methods provided limited solutions for the trade-off between memory consumption and communication cost. In this paper, we rethink the impact of memory consumption and communication costs on...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recently, various distributed strategies for large language model training
have been proposed. However, these methods provided limited solutions for the
trade-off between memory consumption and communication cost. In this paper, we
rethink the impact of memory consumption and communication costs on the
training speed of large language models, and propose a memory-communication
balanced strategy set Partial Redundancy Optimizer (PaRO). PaRO provides
comprehensive options which reduces the amount and frequency of inter-group
communication with minor memory redundancy by fine-grained sharding strategy,
thereby improving the training efficiency in various training scenarios.
Additionally, we propose a Hierarchical Overlapping Ring (HO-Ring)
communication topology to enhance communication efficiency between nodes or
across switches in large language model training. Our experiments demonstrate
that PaRO significantly improves training throughput by 1.19x-2.50x compared to
the SOTA method and achieves a near-linear scalability. The HO-Ring algorithm
improves communication efficiency by 36.5% compared to the traditional Ring
algorithm. |
---|---|
DOI: | 10.48550/arxiv.2310.06003 |