Pre-check inspection of hardware accelerators in distributed systems

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, are described for performing pre-check checks for distributed computing systems. In one aspect, a method includes allocating a computing workload to a first subset of hardware accelerator machines, eac...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: GHAFFARKHAN ALIREZA, LIU JIANQIAO, NIZAI, ARASH, DANG HUNG-VU, ZHU JIAFAN, DONG XIANGYU, YANG KEXIN, ZHANG XIAO, KONG XIANGLING, ZU YAZHOU, ZHAO YONG, KORBASOV ALEKSANDR VADIMOVICH, TANG JIKAI, DU DAYOU
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, are described for performing pre-check checks for distributed computing systems. In one aspect, a method includes allocating a computing workload to a first subset of hardware accelerator machines, each hardware accelerator machine having one or more hardware accelerators. A pre-check is performed on the first subset to verify functionality of each machine in the first subset prior to executing the computing workload. For each hardware accelerator machine of the first subset, a program code package is installed, including task actions based at least in part on the characteristics of the computing workload. A task action including a sequence of operations is performed on a hardware accelerator machine to determine whether the task action fails. Whenever the task operation fails, the compute workload is reassigned to a second subset of the hardware accelerator machines different from the first subset. 描述了用于执行分布式计算