Enabling AI Supercomputers with Domain-specific Networks

Systems designed for AI training and inference exhibit characteristics of both capacity and capability systems that require both tight coupling and strong scaling for model parallelism as well as weak scaling for data parallelism in distributed systems. In addition, managing enormous, 100-billion-pa...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE MICRO 2023-11, p.1-9
Hauptverfasser:	Abts, Dennis, Kim, John
Format:	Artikel
Sprache:	eng
Schlagworte:	Bandwidth Costs Hardware Internet Parallel processing Supercomputers Training
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Systems designed for AI training and inference exhibit characteristics of both capacity and capability systems that require both tight coupling and strong scaling for model parallelism as well as weak scaling for data parallelism in distributed systems. In addition, managing enormous, 100-billion-parameter language models and trillion-token data sets introduces formidable computational challenges for today's supercomputing infrastructure. Communication and computation are two intertwined aspects of parallel computing, including AI domainspecific supercomputers, and this paper explores the vital role of interconnection networks in large-scale systems. This work argues how domain-specific networks are a critical enabling technology necessary for AI supercomputers. In particular, we advocate for flexible, low-latency interconnects capable of delivering high throughput across massive scales with tens of thousands of endpoints. Additionally, we stress the importance of reliability and resilience in handling long-duration training workloads and the demanding inference needs of domain-specific workloads.
ISSN:	0272-1732
DOI:	10.1109/MM.2023.3330079