COMPRESSION AS A SOLUTION FOR CONGESTION CONTROL ON AI WORKLOADS

Methods and apparatus for employing selective compression for addressing congestion control for Artificial Intelligence (AI) workloads. Multiple interconnected compute nodes are used for performing an AI workload in a distributed environment, such as training an AI model. Periodically, such as follo...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	MILLER, Steven C, SAPIO, Amedeo, RAMACHANDRAN, Aswin
Format:	Patent
Sprache:	eng
Schlagworte:	CALCULATING COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS COMPUTING COUNTING PHYSICS
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Methods and apparatus for employing selective compression for addressing congestion control for Artificial Intelligence (AI) workloads. Multiple interconnected compute nodes are used for performing an AI workload in a distributed environment, such as training an AI model. Periodically, such as following an epoch for processing batches of training data in parallel, the compute nodes exchange Tensor data (e.g., local model gradients) with one another, which may lead to network/fabric congestion. Compute nodes and/or switches in the distributed environment are configured to detect current or projected network/fabric congestion and to selectively apply variable rate compression to packets containing the Tensor data to alleviate/avoid the congestion. Tensor data may be selectively applied at source compute nodes by computing a network pause time and comparing that time to a compression compute time. Switches may selectively compress packets to be forwarded to destination compute nodes based on buffer/queue fill levels and/or other network telemetry data.