Flexible Job Scheduling with Spatial-Temporal Compatibility for In-Network Aggregation

In-Network Aggregation (INA) solutions represent the forefront in advancing All-Reduce, utilizing limited switch memory for efficient gradient aggregation. However, existing INA solutions primarily focus on enhancing aggregation efficiency, often overlooking the efficient utilization of memory. Isol...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on computers 2025, p.1-12
Hauptverfasser: Li, Yulong, Li, Wenxin, Du, Yuxuan, Yao, Yinan, Zhang, Song, Zhong, Linxuan, Li, Keqiu
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In-Network Aggregation (INA) solutions represent the forefront in advancing All-Reduce, utilizing limited switch memory for efficient gradient aggregation. However, existing INA solutions primarily focus on enhancing aggregation efficiency, often overlooking the efficient utilization of memory. Isolation solutions typically pre-allocate resources for each job, leading to memory wastage due to the uncontrolled use of resources. In contrast, the sharing solutions encounter significant memory contention, resulting in performance degradation within a multitenant environment. In this paper, we propose DynaINA, a flexible job scheduler to support multi-tenant training. The core idea of DynaINA is to provide spatial and temporal compatibility between jobs. For spatial compatibility, DynaINA utilizes multiple dynamic memory pools to provide job isolation. For temporal compatibility, DynaINA employs contention-aware job scheduling to facilitate memory sharing. Furthermore, DynaINA prioritizes communication-intensive jobs, leveraging the benefits of INA to enhance overall performance in training clusters. Extensive experiments with popular vision and language models demonstrate that DynaINA reduces training time by up to 65.16% and improves switch memory utilization by up to 85.02% compared to state-of-the-art solutions in a 100Gbps network.
ISSN:0018-9340
1557-9956
DOI:10.1109/TC.2024.3523420