Enabling AI Supercomputers with Domain-specific Networks
Systems designed for AI training and inference exhibit characteristics of both capacity and capability systems that require both tight coupling and strong scaling for model parallelism as well as weak scaling for data parallelism in distributed systems. In addition, managing enormous, 100-billion-pa...
Gespeichert in:
Veröffentlicht in: | IEEE MICRO 2023-11, p.1-9 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Systems designed for AI training and inference exhibit characteristics of both capacity and capability systems that require both tight coupling and strong scaling for model parallelism as well as weak scaling for data parallelism in distributed systems. In addition, managing enormous, 100-billion-parameter language models and trillion-token data sets introduces formidable computational challenges for today's supercomputing infrastructure. Communication and computation are two intertwined aspects of parallel computing, including AI domainspecific supercomputers, and this paper explores the vital role of interconnection networks in large-scale systems. This work argues how domain-specific networks are a critical enabling technology necessary for AI supercomputers. In particular, we advocate for flexible, low-latency interconnects capable of delivering high throughput across massive scales with tens of thousands of endpoints. Additionally, we stress the importance of reliability and resilience in handling long-duration training workloads and the demanding inference needs of domain-specific workloads. |
---|---|
ISSN: | 0272-1732 |
DOI: | 10.1109/MM.2023.3330079 |