Fail-safe concurrency in the EcliPSe system

Local or wide‐area heterogeneous workstation clusters are relatively cheap and highly effective, though inherently unstable operating environments for long‐running distributed computations. We found this to be the case in early experiments with a prototype of the EcliPSe system, a software toolkit f...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Concurrency (Chichester, England.) England.), 1996-05, Vol.8 (4), p.283-312
Hauptverfasser:	Knop, Felipe, Rego, Vernon, Sunderam, Vaidy
Format:	Artikel
Sprache:	eng
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Local or wide‐area heterogeneous workstation clusters are relatively cheap and highly effective, though inherently unstable operating environments for long‐running distributed computations. We found this to be the case in early experiments with a prototype of the EcliPSe system, a software toolkit for replicative applications on heterogeneous workstation clusters. Hardware or network failures in computations that executed for over a day were not uncommon. In this work, a variety of features for the incorporation of failure resilience in the EcliPSe system are described. Key characteristics of this fault‐tolerant system are ease of use, low state‐saving cost, system scalability and good performance. We present results of some experiments demonstrating low state‐saving overheads and small system‐recovery times, as a function of the amount of state saved.
ISSN:	1040-3108 1096-9128
DOI:	10.1002/(SICI)1096-9128(199605)8:4<283::AID-CPE224>3.0.CO;2-#