Topology aware multi-stage method for trunking communication
In distributed training, a first compute node may divide a global reduction operation into a plurality of sub-operations. A first computing node may perform a reduction scatter sub-operation between a first set of processing units in the first computing node according to a first trunking communicati...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , |
---|---|
Format: | Patent |
Sprache: | chi ; eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In distributed training, a first compute node may divide a global reduction operation into a plurality of sub-operations. A first computing node may perform a reduction scatter sub-operation between a first set of processing units in the first computing node according to a first trunking communication algorithm, and executing global reduction sub-operation between the first processing unit set in the first computing node and the second processing unit set in the second computing node according to a second trunking communication algorithm, and executing global aggregation sub-operation between the first processing unit set of the first computing node according to the first trunking communication algorithm.
在分布式训练中,第一计算节点可以将全局归约运算划分为多个子运算。第一计算节点可以根据第一集群通信算法在该第一计算节点中的第一处理单元集合之间执行归约散布子运算,根据第二集群通信算法在该第一计算节点中的第一处理单元集合和第二计算节点中的第二处理单元集合之间执行全局归约子运算,并根据第一集群通信算法在该第一计算节点的第一处理单元集合之间执行全局聚集子运算。 |
---|