Sampling + DMR: practical and low-overhead permanent fault detection

With technology scaling, manufacture-time and in-field permanent faults are becoming a fundamental problem. Multi-core architectures with spares can tolerate them by detecting and isolating faulty cores, but the required fault detection coverage becomes effectively 100% as the number of permanent fa...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Nomura, Shuou, Sinclair, Matthew D., Ho, Chen-Han, Govindaraju, Venkatraman, de Kruijf, Marc, Sankaralingam, Karthikeyan
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Analytical models Circuit faults Computer systems organization > Dependable and fault-tolerant systems and networks Dual-modular redundancy Fault detection Fault tolerance Mathematical model Permanent Fault Reliability Sampling Upper bound Vectors
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	With technology scaling, manufacture-time and in-field permanent faults are becoming a fundamental problem. Multi-core architectures with spares can tolerate them by detecting and isolating faulty cores, but the required fault detection coverage becomes effectively 100% as the number of permanent faults increases. Dual-modular redundancy(DMR) can provide 100% coverage without assuming device-level fault models, but its overhead is excessive. In this paper, we explore a simple and low-overhead mechanism we call Sampling-DMR: run in DMR mode for a small percentage (1% of the time for example) of each periodic execution window (5 million cycles for example). Although Sampling-DMR can leave some errors undetected, we argue the permanent fault coverage is 100% because it can detect all faults eventually. Sampling-DMR thus introduces a system paradigm of restricting all permanent faults' effects to small finite windows of error occurrence. We prove an ultimate upper bound exists on total missed errors and develop a probabilistic model to analyze the distribution of the number of undetected errors and detection latency. The model is validated using full gate-level fault injection experiments for an actual processor running full application software. Sampling-DMR outperforms conventional techniques in terms of fault coverage, sustains similar detection latency guarantees, and limits energy and performance overheads to less than 2%.
ISSN:	1063-6897 2575-713X
DOI:	10.1145/2000064.2000089