Network switch with integrated gradient aggregation for distributed machine learning

Distributed machine learning systems and other distributed computing systems are improved by embedding compute logic at the network switch level to perform collective actions, such as reduction operations, on gradients or other data processed by the nodes of the system. The switch is configured to r...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Matthews, William Brad, Agarwal, Puneet
Format: Patent
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Distributed machine learning systems and other distributed computing systems are improved by embedding compute logic at the network switch level to perform collective actions, such as reduction operations, on gradients or other data processed by the nodes of the system. The switch is configured to recognize data units that carry data associated with a collective action that needs to be performed by the distributed system, referred to herein as "compute data," and process that data using a compute subsystem within the switch. The compute subsystem includes a compute engine that is configured to perform various operations on the compute data, such as "reduction" operations, and forward the results back to the compute nodes. The reduction operations may include, for instance, summation, averaging, bitwise operations, and so forth. In this manner, the network switch may take over some or all of the processing of the distributed system during the collective phase.