Topology aware multi-stage method for trunking communication

In distributed training, a first compute node may divide a global reduction operation into a plurality of sub-operations. A first computing node may perform a reduction scatter sub-operation between a first set of processing units in the first computing node according to a first trunking communicati...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: DUAN JIANJUN, WANG SHAOCHUANG, TANG LINGBO, FENG FEI, YANG JIAN, YAN LEI, YE JIANXI, PENG LIWEI, DONG JIANBO, SONG DONGYANG, RAN QIANYUAN
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In distributed training, a first compute node may divide a global reduction operation into a plurality of sub-operations. A first computing node may perform a reduction scatter sub-operation between a first set of processing units in the first computing node according to a first trunking communication algorithm, and executing global reduction sub-operation between the first processing unit set in the first computing node and the second processing unit set in the second computing node according to a second trunking communication algorithm, and executing global aggregation sub-operation between the first processing unit set of the first computing node according to the first trunking communication algorithm. 在分布式训练中,第一计算节点可以将全局归约运算划分为多个子运算。第一计算节点可以根据第一集群通信算法在该第一计算节点中的第一处理单元集合之间执行归约散布子运算,根据第二集群通信算法在该第一计算节点中的第一处理单元集合和第二计算节点中的第二处理单元集合之间执行全局归约子运算,并根据第一集群通信算法在该第一计算节点的第一处理单元集合之间执行全局聚集子运算。