Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters
This paper presents a low-cost network architecture for training large language models (LLMs) at hyperscale. We study the optimal parallelization strategy of LLMs and propose a novel datacenter network design tailored to LLM's unique communication pattern. We show that LLM training generates sp...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This paper presents a low-cost network architecture for training large
language models (LLMs) at hyperscale. We study the optimal parallelization
strategy of LLMs and propose a novel datacenter network design tailored to
LLM's unique communication pattern. We show that LLM training generates sparse
communication patterns in the network and, therefore, does not require
any-to-any full-bisection network to complete efficiently. As a result, our
design eliminates the spine layer in traditional GPU clusters. We name this
design a Rail-only network and demonstrate that it achieves the same training
performance while reducing the network cost by 38% to 77% and network power
consumption by 37% to 75% compared to a conventional GPU datacenter. Our
architecture also supports Mixture-of-Expert (MoE) models with all-to-all
communication through forwarding, with only 8.2% to 11.2% completion time
overhead for all-to-all traffic. We study the failure robustness of Rail-only
networks and provide insights into the performance impact of different network
and training parameters. |
---|---|
DOI: | 10.48550/arxiv.2307.12169 |