LUT‐DSP usage trade‐off for re‐configurable convolution acceleration core based on small logarithmic floating point representation
The challenge in designing the high‐performance field‐programmable gate array (FPGA)‐based convolution accelerator is to take full advantage of the on‐chip computing resources. The reported CNN accelerators always exhaust the on‐chip DSPs and leave other computing resources under‐utilized. Hence, th...
Gespeichert in:
Veröffentlicht in: | International journal of circuit theory and applications 2024-04, Vol.52 (4), p.1864-1871 |
---|---|
Hauptverfasser: | , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The challenge in designing the high‐performance field‐programmable gate array (FPGA)‐based convolution accelerator is to take full advantage of the on‐chip computing resources. The reported CNN accelerators always exhaust the on‐chip DSPs and leave other computing resources under‐utilized. Hence, this brief presents a novel convolution acceleration core based on the small logarithmic floating‐point (SLFP) format, which results in three contributions. (1) The SLFP multiplier is only implemented with
13× LUT6s and operates at 650 MHz with the aid of the carry chain, which provides sufficient accuracy for most CNNs. In addition, a similar structure can be used to implement a SLFP divider. (2) The DSPs in the TWO24 SIMD mode are cascaded to implement a 9‐input adder tree. The sum of the multiples of
9× elements (e.g.,
18×,
27×) is easily obtained by configuring the last DSP in the 9‐input adder tree in the accumulation mode, which can support more kernels (e.g.,
5×5,
128×1×1) with a high utilization rate (
≈90%). (3) The convolution core based on the SLFP format only uses
654× LUT6s and
7× DSPs to achieve 1300 MOPS, 433 MOPS, and 81 MOPS for
3×3,
5×5, and
128×1×1 kernel, respectively. In summary, the proposed convolution accelerator not only balances the resource usage of LUT6s and DSPs but also quantizes most CNN models using several simple scaling operations instead of a computing‐intensive retraining algorithm because the distribution of SLFP numbers is very similar to FP32 numbers.
The challenge in designing the high‐performance field‐programmable gate array (FPGA)‐based convolution accelerator is to take full advantage of the on‐chip computing resources. This article presents a novel convolution acceleration core based on the small logarithmic floating‐point (SLFP) format, which not only balances the resource usage of LUT6s and DSPs but also quantizes most CNN models using several simple scaling operations instead of a computing‐intensive retraining algorithm. |
---|---|
ISSN: | 0098-9886 1097-007X |
DOI: | 10.1002/cta.3834 |