NPU fault-tolerant scheduling system of computer cluster

The invention discloses an NPU fault-tolerant scheduling system of a computer cluster, which is based on NPU equipment supporting hardware health degree query, is provided with an NPU card group in nodes and a multi-node cluster topology, and realizes node-level and system-level fault tolerance. By...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: CUI SHUYAO, TANG XIAOYU, QIU JIBING, TANG ZHAORONG
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The invention discloses an NPU fault-tolerant scheduling system of a computer cluster, which is based on NPU equipment supporting hardware health degree query, is provided with an NPU card group in nodes and a multi-node cluster topology, and realizes node-level and system-level fault tolerance. By defining the affinity calculation and the working load state of the NPU card, the task scheduling considering the hardware affinity and the real-time load is realized. In addition, the system provides different fault-tolerant mechanisms for reasoning tasks and training tasks, and scheduling can be carried out corresponding to single event flipping errors and downtime errors. Compared with a traditional hardware redundancy mode, the system has the advantages that the resource utilization efficiency, the real-time performance, the self-adaptability and the like are remarkably improved, and the system is more suitable for a large-scale and complex computing environment. 本发明公开了一种计算机集群的NPU容错调度系统,基于支持硬件健康度查询的NPU设备、具备节点内N