Implementation and evaluation of a highly reliable server with QRM

For the purpose of providing higher reliability to multiple open‐server computers in a distributed processing environment without changing platforms and application programs, we have developed QRM (Quick Rollback Module), an add‐on module used to provide higher reliability based on checkpoints and t...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Systems and computers in Japan 1998-04, Vol.29 (4), p.11-21
Hauptverfasser: Hirayama, Hideaki, Masubuchi, Yoshio, Hoshina, Satoshi, Shimada, Tomofumi, Kato, Nobuhiro, Nozaki, Masaharu
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:For the purpose of providing higher reliability to multiple open‐server computers in a distributed processing environment without changing platforms and application programs, we have developed QRM (Quick Rollback Module), an add‐on module used to provide higher reliability based on checkpoints and the rollback mechanism. By adding QRM, computers recover from most transient hardware failures of nearby processors, from some permanent processor failures in multiprocessor computers, and from most panics caused by operating system hugs. QRM consists of hardware implemented as an attachment board and software implemented as loadable modules. With QRM installed, computers recover from failures within a few seconds; however, there exists a slight overhead in normal processing. Though the overhead changes according to the programs executed, the overhead in TPC‐C bechmark is 10.9%. In this paper, we describe the implementation and the results of a performance evaluation of QRM. © 1998 Scripta Technica, Syst Comp Jpn, 29(4): 11–21, 1998
ISSN:0882-1666
1520-684X
DOI:10.1002/(SICI)1520-684X(199804)29:4<11::AID-SCJ2>3.0.CO;2-R