Remarn: A Reconfigurable Multi-threaded Multi-core Accelerator for Recurrent Neural Networks

This work introduces Remarn, a reconfigurable multi-threaded multi-core accelerator supporting both spatial and temporal co-execution of Recurrent Neural Network (RNN) inferences. It increases processing capabilities and quality of service of cloud-based neural processing units (NPUs) by improving t...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ACM transactions on reconfigurable technology and systems 2023-03, Vol.16 (1), p.1-26, Article 4
Hauptverfasser:	Que, Zhiqiang, Nakahara, Hiroki, Fan, Hongxiang, Li, He, Meng, Jiuxi, Tsoi, Kuen Hung, Niu, Xinyu, Nurvitadhi, Eriko, Luk, Wayne
Format:	Artikel
Sprache:	eng
Schlagworte:	Application specific processors Computer systems organization Hardware Hardware accelerators Multicore architectures Neural networks Special purpose systems
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This work introduces Remarn, a reconfigurable multi-threaded multi-core accelerator supporting both spatial and temporal co-execution of Recurrent Neural Network (RNN) inferences. It increases processing capabilities and quality of service of cloud-based neural processing units (NPUs) by improving their hardware utilization and by reducing design latency, with two innovations. First, a custom coarse-grained multi-threaded RNN/Long Short-Term Memory (LSTM) hardware architecture, switching tasks among threads when RNN computational engines meet data hazards. Second, the partitioning of this hardware architecture into multiple full-fledged sub-accelerator cores, enabling spatially co-execution of multiple RNN/LSTM inferences. These innovations improve the exploitation of the available parallelism to increase runtime hardware utilization and boost design throughput. Evaluation results show that a dual-threaded quad-core Remarn NPU achieves 2.91 times higher performance while only occupying 5.0% more area than a single-threaded one on a Stratix 10 FPGA. When compared with a Tesla V100 GPU implementation, our design achieves 6.5 times better performance and 15.6 times higher power efficiency, showing that our approach contributes to high performance and energy-efficient FPGA-based multi-RNN inference designs for datacenters.
ISSN:	1936-7406 1936-7414
DOI:	10.1145/3534969