Isolated Scheduling for Distributed Training Tasks in GPU Clusters

Distributed machine learning (DML) technology makes it possible to train large neural networks in a reasonable amount of time. Meanwhile, as the computing power grows much faster than network capacity, network communication has gradually become the bottleneck of DML. Current multi-tenant GPU cluster...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2023-08
Hauptverfasser:	Han, Xinchi, Jiang, Weihao, Cao, Peirui, Yang, Qinwei, Liu, Yunzhuo, Shuyao Qi, Lin, Shengkai, Zhao, Shizhen
Format:	Artikel
Sprache:	eng
Schlagworte:	Circuits Clusters Communication Machine learning Network topologies Neural networks Resource allocation Resource scheduling Switches Task scheduling Topology optimization Training User experience
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Schreiben Sie den ersten Kommentar!