Coping with silent and fail-stop errors at scale by combining replication and checkpointing

This paper provides a model and an analytical study of replication as a technique to cope with silent errors, as well as a mixture of both silent and fail-stop errors on large-scale platforms. Compared with fail-stop errors that are immediately detected when they occur, silent errors require a detec...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of parallel and distributed computing 2018-12, Vol.122 (C), p.209-225
Hauptverfasser: Benoit, Anne, Cavelan, Aurélien, Cappello, Franck, Raghavan, Padma, Robert, Yves, Sun, Hongyang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!