Runtime Techniques to Mitigate Soft Errors in Network-on-Chip (NoC) Architectures

As aggressive scaling continues to push multiprocessor system-on-chips (MPSoCs) to new limits, complex hardware structures combined with stringent area and power constraints will continue to diminish reliability. Waning reliability in integrated circuits will increase the susceptibility of transient...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on computer-aided design of integrated circuits and systems 2018-03, Vol.37 (3), p.682-695
Hauptverfasser: Boraten, Travis, Kodi, Avinash Karanth
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:As aggressive scaling continues to push multiprocessor system-on-chips (MPSoCs) to new limits, complex hardware structures combined with stringent area and power constraints will continue to diminish reliability. Waning reliability in integrated circuits will increase the susceptibility of transient and permanent faults. There is an urgent demand for adaptive error correction coding (ECC) schemes in network-on-chips to provide fault tolerance and improve overall resiliency of MPSoC architectures. The goal of adaptive ECC schemes should be to maximize power savings when faults are infrequent and increase application speedup by boosting fault coverage when faults are frequent. In this paper, we propose runtime adaptive scrubbing (RAS), a novel multilayered error correction and detection scheme with three modes of operation enabled by an area-efficient configurable encoder for encoding packets on the switch-to-switch (s2s) layer, thus preventing faults from accumulating up the network stack and onto the end-to-end layer. As fault rates fluctuate we propose a dynamic methodology for improving fault localization and intelligently adapt fault coverage on demand to sustain graceful network degradation. RAS successfully improves network resiliency, fault localization, and fault coverage as compared to traditional static s2s schemes. Simulation results demonstrate that static RAS improves network speedup by 10% for Splash-2/PARSEC benchmarks on a 8 \times 8 mesh network while reducing area overhead by 14% and incurring on an average 6.6% power penalty by boosting fault tolerance when fault rates increase. Further, our dynamic RAS scheme maintains 97.88% of network performance for real applications while incurring 20% power penalty.
ISSN:0278-0070
1937-4151
DOI:10.1109/TCAD.2017.2664066