Efficient CORDIC-Based Activation Functions for RNN Acceleration on FPGAs

Recurrent Neural Networks (RNNs), particularly Long Short-Term Memory (LSTM) networks, have emerged as standard tools for tackling a wide range of time series applications, such as natural language processing. However, deploying these models on edge devices presents great challenges due to limited c...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on artificial intelligence 2024-10, p.1-11
Hauptverfasser:	Shen, Wan, Jiang, Junye, Li, Minghan, Liu, Shuanglong
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy activation function Approximation algorithms Artificial intelligence Computer architecture Coordinate Rotation Digital Computer (CORDIC) Digital computers Field programmable gate arrays Field Programmable Gate Arrays (FPGAs) Hardware Hardware Acceleration Long short term memory Recurrent Neural Networks (RNNs) Table lookup Vectors
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Recurrent Neural Networks (RNNs), particularly Long Short-Term Memory (LSTM) networks, have emerged as standard tools for tackling a wide range of time series applications, such as natural language processing. However, deploying these models on edge devices presents great challenges due to limited computational resources. Additionally, the implementation of RNN activation functions on low-end hardware devices significantly impacts the overall network performance, as activations constitute the dominant part of execution time. In this work, we propose an efficient approach for implementing commonly used RNN activations, leveraging an optimized Coordinate Rotation Digital Computer algorithm (CORDIC). Moreover, we propose a unified hardware architecture for mapping the CORDIC-based method onto Field-Programmable Gate Arrays (FPGAs), which can be configured to implement multiple non-linear activation functions. Our architecture reduces the computational time with fewer iterations in CORDIC compared to existing methods, rendering it particularly suitable for resource-constrained edge devices. Our design is implemented on a Xilinx Zynq-7000 device and evaluated across three RNNs and benchmark datasets. Experimental results demonstrate that our design achieves up to a 2× speedup while maintaining model accuracy compared to the state-of-the-art designs.
ISSN:	2691-4581 2691-4581
DOI:	10.1109/TAI.2024.3474648