ReIPE: Recycling Idle PEs in CNN Accelerator for Vulnerable Filters Soft-Error Detection

To satisfy prohibitively massive computational requirements of current deep Convolutional Neural Networks (CNNs), CNN-specific accelerators are widely deployed in large-scale systems. Caused by high-energy neutrons and α-particle strikes, soft error may lead to catastrophic failures when CNN is depl...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:ACM transactions on architecture and code optimization 2024-09, Vol.21 (3), p.1-26, Article 61
Hauptverfasser: Wei, Xiaohui, Wang, Chenyang, Yue, Hengshan, Tan, Jingweijia, Guan, Zeyu, Jiang, Nan, Zheng, Xinyang, Zhao, Jianpeng, Qiu, Meikang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:To satisfy prohibitively massive computational requirements of current deep Convolutional Neural Networks (CNNs), CNN-specific accelerators are widely deployed in large-scale systems. Caused by high-energy neutrons and α-particle strikes, soft error may lead to catastrophic failures when CNN is deployed on high integration density accelerators. As CNNs become ubiquitous in mission-critical domains, ensuring the reliable execution of CNN accelerators in the presence of soft errors is increasingly essential.In this article, we propose to Recycle Idle Processing Elements (PEs) in the CNN accelerator for vulnerable filters soft error detection (ReIPE). Considering the error-sensitivity of filters, ReIPE first carries out a filter-level gradient analysis process to replace fault injection for fast filter-wise error resilience estimation. Then, to achieve maximal reliability benefits, combining the hardware-level systolic array idleness and software-level CNN filter-wise error resilience profile, ReIPE preferentially duplicated loads the most vulnerable filters onto systolic array to recycle idle-column PEs for opportunistically redundant execution (error detection). Exploiting the data reuse properties of accelerators, ReIPE incorporates the error detection process into the original computation flow of accelerators to perform real-time error detection. Once the error is detected, ReIPE will trigger a correction round to rectify the erroneous output. Experimental results performed on LeNet-5, Cifar-10-CNN, AlexNet, ResNet-20, VGG-16, and ResNet-50 exhibit that ReIPE can cover 96.40% of errors while reducing 75.06% performance degradation and 67.79% energy consumption of baseline dual modular redundancy on average. Moreover, to satisfy the reliability requirements of various application scenarios, ReIPE is also applicable for pruned, quantized, and Transformer-based models, as well as portable to other accelerator architectures.
ISSN:1544-3566
1544-3973
DOI:10.1145/3674909