V-LSTM: An Efficient LSTM Accelerator using Fixed Nonzero-Ratio Viterbi-Based Pruning

Long Short-Term Memory (LSTM) has been widely adopted in tasks with sequence data, such as speech recognition and language modeling. LSTM brought significant accuracy improvement by introducing additional parameters to Recurrent Neural Network (RNN). However, increasing number of parameters and comp...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on computer-aided design of integrated circuits and systems 2023-10, Vol.42 (10), p.1-1
Hauptverfasser:	Kim, Taesu, Ahn, Daehyun, Lee, Dongsoo, Kim, Jae-Joon
Format:	Artikel
Sprache:	eng
Schlagworte:	Accelerators Accuracy Chips (memory devices) Compression ratio Computational modeling Costs Data models Dynamic random access memory Efficient Neural Network Inference FPGA Load modeling Logic gates Mathematical models Memory devices Model Compression Network latency Parameters Pruning Random access memory Recurrent neural networks Sparse LSTM Sparse matrices Speech recognition Weight Pruning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Long Short-Term Memory (LSTM) has been widely adopted in tasks with sequence data, such as speech recognition and language modeling. LSTM brought significant accuracy improvement by introducing additional parameters to Recurrent Neural Network (RNN). However, increasing number of parameters and computations also led to inefficiency in computing LSTM on edge devices with limited on-chip memory size and DRAM bandwidth. In order to reduce the latency and energy of LSTM computations, there has been a pressing need for model compression schemes and suitable hardware accelerators. In this paper, we first propose the Fixed Nonzero-ratio Viterbi-based Pruning, which can reduce the memory footprint of LSTM models by 96% with negligible accuracy loss. By applying additional constraints on the distribution of surviving weights in Viterbi-based Pruning, the proposed pruning scheme mitigates the load-imbalance problem and thereby increases the processing engine utilization rate. Then, we propose the V-LSTM, an efficient sparse LSTM accelerator based on the proposed pruning scheme. High compression ratio of the proposed pruning scheme allows the proposed accelerator to achieve 24.9% lower per-sample latency than that of state-of-the-art accelerators. The proposed accelerator is implemented on Xilinx VC-709 FPGA evaluation board running at 200MHz for evaluation.
ISSN:	0278-0070 1937-4151
DOI:	10.1109/TCAD.2023.3243879