Reliability-Aware Runtime Adaption Through a Statically Generated Task Schedule

Device scaling, increasing number of components in a single chip, varying environmental issues, and aging effects have brought severe reliability challenges that impose tight constraints on the operation of a system. To cope with these challenges, this paper proposes a reliability-aware scheduling f...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on very large scale integration (VLSI) systems 2018-01, Vol.26 (1), p.11-22
Hauptverfasser: Rozo, Laura, Landwehr, Aaron Myles, Yan Zheng, Chengmo Yang, Guang Gao
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Device scaling, increasing number of components in a single chip, varying environmental issues, and aging effects have brought severe reliability challenges that impose tight constraints on the operation of a system. To cope with these challenges, this paper proposes a reliability-aware scheduling framework that combines static and dynamic analyses to improve the overall system resiliency to different kinds of faults (i.e., intermittent, transient, and permanent). The static analysis technique employs genetic algorithms to optimize the overall system reliability by considering reliability level (RL) as an intermediate scheduling dimension and creating a task-to-RL mapping. This enables the RL-to-core mapping to be efficiently adapted at runtime according to fault rate variations, while the task-to-RL mapping can still be reused. The dynamic analysis tracks faults appearing in each core and measures the time correlation of those faults to update the RL-to-core mapping. The proposed reliability-aware framework is implemented in a state-of-the-art runtime system, Delaware Adaptive Run-Time System, so as to quantitatively show the advantages of using the overall framework in existing multicore platforms. Experimental results show that the proposed technique delivers up to 30% improvement in application execution time and up to 72% improvement in faults occurring at runtime.
ISSN:1063-8210
1557-9999
DOI:10.1109/TVLSI.2017.2753242