COMPRESSION AS A SOLUTION FOR CONGESTION CONTROL ON AI WORKLOADS
Methods and apparatus for employing selective compression for addressing congestion control for Artificial Intelligence (AI) workloads. Multiple interconnected compute nodes are used for performing an AI workload in a distributed environment, such as training an AI model. Periodically, such as follo...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Patent |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Methods and apparatus for employing selective compression for addressing congestion control for Artificial Intelligence (AI) workloads. Multiple interconnected compute nodes are used for performing an AI workload in a distributed environment, such as training an AI model. Periodically, such as following an epoch for processing batches of training data in parallel, the compute nodes exchange Tensor data (e.g., local model gradients) with one another, which may lead to network/fabric congestion. Compute nodes and/or switches in the distributed environment are configured to detect current or projected network/fabric congestion and to selectively apply variable rate compression to packets containing the Tensor data to alleviate/avoid the congestion. Tensor data may be selectively applied at source compute nodes by computing a network pause time and comparing that time to a compression compute time. Switches may selectively compress packets to be forwarded to destination compute nodes based on buffer/queue fill levels and/or other network telemetry data. |
---|