Flexible Job Scheduling with Spatial-Temporal Compatibility for In-Network Aggregation
In-Network Aggregation (INA) solutions represent the forefront in advancing All-Reduce, utilizing limited switch memory for efficient gradient aggregation. However, existing INA solutions primarily focus on enhancing aggregation efficiency, often overlooking the efficient utilization of memory. Isol...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on computers 2025, p.1-12 |
---|---|
Hauptverfasser: | , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In-Network Aggregation (INA) solutions represent the forefront in advancing All-Reduce, utilizing limited switch memory for efficient gradient aggregation. However, existing INA solutions primarily focus on enhancing aggregation efficiency, often overlooking the efficient utilization of memory. Isolation solutions typically pre-allocate resources for each job, leading to memory wastage due to the uncontrolled use of resources. In contrast, the sharing solutions encounter significant memory contention, resulting in performance degradation within a multitenant environment. In this paper, we propose DynaINA, a flexible job scheduler to support multi-tenant training. The core idea of DynaINA is to provide spatial and temporal compatibility between jobs. For spatial compatibility, DynaINA utilizes multiple dynamic memory pools to provide job isolation. For temporal compatibility, DynaINA employs contention-aware job scheduling to facilitate memory sharing. Furthermore, DynaINA prioritizes communication-intensive jobs, leveraging the benefits of INA to enhance overall performance in training clusters. Extensive experiments with popular vision and language models demonstrate that DynaINA reduces training time by up to 65.16% and improves switch memory utilization by up to 85.02% compared to state-of-the-art solutions in a 100Gbps network. |
---|---|
ISSN: | 0018-9340 1557-9956 |
DOI: | 10.1109/TC.2024.3523420 |