Fault recovery on a massively parallel computer system to handle node failures without ending an executing job

A method and apparatus for fault recovery of on a parallel computer system from a soft failure without ending an executing job on a partition of nodes. In preferred embodiments a failed hardware recovery mechanism on a service node uses a heartbeat monitor to determine when a node failure occurs. Wh...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: DARRINGTON DAVID L, MCCARTHY PATRICK JOSEPH, SIDELNIK ALBERT, PETERS AMANDA
Format: Patent
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!