Hardware Fault Recovery for I/O Intensive Applications

With continued process scaling, the rate of hardware failures in commodity systems is increasing. Because these commodity systems are highly sensitive to cost, traditional solutions that employ heavy redundancy to handle such failures are no longer acceptable owing to their high associated costs. De...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:ACM transactions on architecture and code optimization 2014-10, Vol.11 (3), p.1-25
Hauptverfasser: Ramachandran, Pradeep, Hari, Siva Kumar Sastry, Li, Manlap, Adve, Sarita V.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:With continued process scaling, the rate of hardware failures in commodity systems is increasing. Because these commodity systems are highly sensitive to cost, traditional solutions that employ heavy redundancy to handle such failures are no longer acceptable owing to their high associated costs. Detecting such faults by identifying anomalous software execution and recovering through checkpoint-and-replay is emerging as a viable low-cost alternative for future commodity systems. An important but commonly ignored aspect of such solutions is ensuring that external outputs to the system are fault-free. The outputs must be delayed until the detectors guarantee this, influencing fault-free performance. The overheads for resiliency must thus be evaluated while taking these delays into consideration; prior work has largely ignored this relationship. This article concerns recovery for I/O intensive applications from in-core faults. We present a strategy to buffer external outputs using dedicated hardware and show that checkpoint intervals previously considered as acceptable incur exorbitant overheads when hardware buffering is considered. We then present two techniques to reduce the checkpoint interval and demonstrate a practical solution that provides high resiliency while incurring low overheads.
ISSN:1544-3566
1544-3973
DOI:10.1145/2656342