Multipurpose Deep-Learning Accelerator for Arbitrary Quantization With Reduction of Storage, Logic, and Latency Waste

Various pruning and quantization heuristics have been proposed to compress recent deep-learning models. However, the rapid development of new optimization techniques makes it difficult for domain-specific accelerators to efficiently process various models showing irregularly stored parameters or non...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE journal of solid-state circuits 2024-01, Vol.59 (1), p.143-156
Hauptverfasser:	Moon, Seunghyun, Mun, Han-Gyeol, Son, Hyunwoo, Sim, Jae-Yoon
Format:	Artikel
Sprache:	eng
Schlagworte:	Accelerator architectures Arbitrary quantization (AQ) bit-serial processing Decoding Deep learning deep neural network (DNN) accelerator Hardware Logic lookup table (LUT) Lookup tables Multiplication Optimization techniques Power efficiency precision scalability Quantization (signal) Reconfiguration Run time (computers) run-length compression (RLC) Scalability Table lookup
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Various pruning and quantization heuristics have been proposed to compress recent deep-learning models. However, the rapid development of new optimization techniques makes it difficult for domain-specific accelerators to efficiently process various models showing irregularly stored parameters or nonlinear quantization. This article presents a scalable-precision deep-learning accelerator that supports multiply-and-accumulate operations (MACs) with two arbitrarily quantized data sequences. The proposed accelerator includes three main features. To minimize logic overhead when processing arbitrarily quantized 8-bit precision data, a lookup table (LUT)-based runtime reconfiguration is proposed. The use of bit-serial execution without unnecessary computations enables the multiplication of data with non-equal precision while minimizing logic and latency waste. Furthermore, two distinct data formats, raw and run-length compressed, are supported by a zero-eliminator (ZE) and runtime-density detector (RDD) that are compatible with both formats, delivering enhanced storage and performance. For a precision range of 1-8 bit and fixed sparsity of 30%, the accelerator implemented in 28 nm low-power (LP) CMOS shows a peak performance of 0.87-5.55 TOPS and a power efficiency of 15.1-95.9 TOPS/W. The accelerator supports processing with arbitrary quantization (AQ) while achieving state-of-the-art (SOTA) power efficiency.
ISSN:	0018-9200 1558-173X
DOI:	10.1109/JSSC.2023.3312615