Estimating Silent Data Corruption Rates Using a Two-Level Model
High-performance and safety-critical system architects must accurately evaluate the application-level silent data corruption (SDC) rates of processors to soft errors. Such an evaluation requires error propagation all the way from particle strikes on low-level state up to the program output. Existing...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | High-performance and safety-critical system architects must accurately
evaluate the application-level silent data corruption (SDC) rates of processors
to soft errors. Such an evaluation requires error propagation all the way from
particle strikes on low-level state up to the program output. Existing
approaches that rely on low-level simulations with fault injection cannot
evaluate full applications because of their slow speeds, while
application-level accelerated fault testing in accelerated particle beams is
often impractical. We present a new two-level methodology for application
resilience evaluation that overcomes these challenges. The proposed approach
decomposes application failure rate estimation into (1) identifying how
particle strikes in low-level unprotected state manifest at the
architecture-level, and (2) measuring how such architecture-level
manifestations propagate to the program output. We demonstrate the
effectiveness of this approach on GPU architectures. We also show that using
just one of the two steps can overestimate SDC rates and produce different
trends---the composition of the two is needed for accurate reliability
modeling. |
---|---|
DOI: | 10.48550/arxiv.2005.01445 |