Pre-check inspection of hardware accelerators in distributed systems
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, are described for performing pre-check checks for distributed computing systems. In one aspect, a method includes allocating a computing workload to a first subset of hardware accelerator machines, eac...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , |
---|---|
Format: | Patent |
Sprache: | chi ; eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, are described for performing pre-check checks for distributed computing systems. In one aspect, a method includes allocating a computing workload to a first subset of hardware accelerator machines, each hardware accelerator machine having one or more hardware accelerators. A pre-check is performed on the first subset to verify functionality of each machine in the first subset prior to executing the computing workload. For each hardware accelerator machine of the first subset, a program code package is installed, including task actions based at least in part on the characteristics of the computing workload. A task action including a sequence of operations is performed on a hardware accelerator machine to determine whether the task action fails. Whenever the task operation fails, the compute workload is reassigned to a second subset of the hardware accelerator machines different from the first subset.
描述了用于执行分布式计算 |
---|