Distributed restart in a multiple processor system

Software or hardware on one node or processor in a system with multiple processors or nodes performs a cold or a warm restart on one or more other processors. Fault tolerance mechanisms are provided in a computing architecture to allow it to continue functioning when individual components, such as c...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Foster, Mark, Chaiken, David
Format: Patent
Sprache:eng
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Software or hardware on one node or processor in a system with multiple processors or nodes performs a cold or a warm restart on one or more other processors. Fault tolerance mechanisms are provided in a computing architecture to allow it to continue functioning when individual components, such as chips, printed circuit boards, network links, fans, or power supplies fail. One aspect of the invention provides multiple processors having self-contained operating systems. Each processor preferably comprises any of redundant network links; redundant power supplies; redundant links to input/output devices; and software fault detection, adaptation, and recovery algorithms. Once a processor in the system has failed, the system attempts to recover from the failure by restarting a failed processor. Because the preferred system is constructed as a set of self-contained processing units, it is possible to restart the system at a number of granularities, e.g. a chip, a printed circuit board (node), a subset of processors, or an entire engine. Each of these restarts can be any of a cold, warm, and/or software restart, and can be invoked by any of hardware, e.g. by a watchdog timer, software, e.g. by fault recovery algorithms, or by a human operator.