CAP: Communication-Aware Automated Parallelization for Deep Learning Inference on CMP Architectures

Real-time inference of deep learning models on embedded and energy-efficient devices becomes increasingly desirable with the rapid growth of artificial intelligence on edge. Specifically, to achieve superb energy-efficiency and scalability, efficient parallelization of single-pass deep neural networ...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on computers 2022-07, Vol.71 (7), p.1626-1639
Hauptverfasser:	Zou, Kaiwei, Wang, Ying, Cheng, Long, Qu, Songyun, Li, Huawei, Li, Xiaowei
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Artificial intelligence Artificial neural networks Computer architecture Deep learning Energy efficiency Inference Kernel Machine learning Multicore processing Multiprocessing Neural networks Noise tolerance Parallel processing Performance enhancement Real time real-time and embedded systems Real-time systems reinforcement learning single-chip multiprocessors structured sparsity System-on-chip
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Real-time inference of deep learning models on embedded and energy-efficient devices becomes increasingly desirable with the rapid growth of artificial intelligence on edge. Specifically, to achieve superb energy-efficiency and scalability, efficient parallelization of single-pass deep neural network (DNN) inference on chip multiprocessor (CMP) architectures is urgently required by many time-sensitive applications. However, as the number of processing cores scales up and the performance of cores has grown much fast, the on-chip inter-core data movement is prone to be a performance bottleneck for computation. To remedy this problem and further improve the performance of network inference, in this work, we introduce a communication-aware DNN parallelization technique called CAP, by exploiting the elasticity and noise-tolerance of deep learning algorithms on CMP. Moreover, in the hope that the conducted studies can provide new design values for real-time neural network inference on embedded chips, we also have evaluated the proposed approach on both multi-core Neural Network Accelerators (NNA) chips and general-purpose chip-multiprocessors. Our experimental results show that the proposed CAP can achieve 1.12×-1.65× system speedups and 1.14×-2.70× energy efficiency for different neural networks while maintaining the inference accuracy, compared to baseline approaches.
ISSN:	0018-9340 1557-9956
DOI:	10.1109/TC.2021.3099688