Checkpointing and rollback recovery for network of workstations

Network of workstations (NOW) now becomes one of the main trends of parallel computing. But for long-running scientific programs, it needs effective fault tolerance for its changing property. Checkpointing and rollback recovery is a solution to this problem. First the main problems upon rollback rec...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Science China. Technological sciences 1999-04, Vol.42 (2), p.207-214
Hauptverfasser: Wang, Dongsheng, Zheng, Weimin, Wang, Dingxing, Shen, Meiming
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Network of workstations (NOW) now becomes one of the main trends of parallel computing. But for long-running scientific programs, it needs effective fault tolerance for its changing property. Checkpointing and rollback recovery is a solution to this problem. First the main problems upon rollback recovery are discussed, the different checkpointing techniques for NOW are analyzed, and then the design and implementation of ChaRM (checkpoint-based rollback recovery and process migration) system are described. The comparison of three coordinated checkpointing systems is given.
ISSN:1006-9321
1674-7321
1862-281X
1869-1900
DOI:10.1007/BF02917117