Low Latency YOLOv3-Tiny Accelerator for Low-Cost FPGA Using General Matrix Multiplication Principle
This paper presents a comprehensive hardware accelerator architecture of YOLOv3-Tiny targeted for low-cost FPGA with a high frame rate, high accuracy, and low latency. The proposed accelerator implements all YOLO layers in hardware including zero pad layer, convolution layer, leaky ReLU layer, batch...
Gespeichert in:
Veröffentlicht in: | IEEE access 2021, Vol.9, p.141890-141913 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This paper presents a comprehensive hardware accelerator architecture of YOLOv3-Tiny targeted for low-cost FPGA with a high frame rate, high accuracy, and low latency. The proposed accelerator implements all YOLO layers in hardware including zero pad layer, convolution layer, leaky ReLU layer, batch normalization layer, max-pooling layer, and up-sampling layer. The architecture is built based on data flow and control flow hybrid architecture. The data preparation and computation process work asynchronously using the data flow paradigm, while the overall governing process is controlled by proposed custom instruction set which adopts the principle of control flow paradigm. The principle of General Matrix Multiplication (GEMM) is adopted to compute the convolution process. We designed a GEMM processor using an optimum size of systolic array architecture. The systolic core is small and the overall architecture supports the multicore system, making it scalable to be implemented on larger size FPGAs. We also proposed a hardware architecture for mapping feature maps into matrix form for GEMM convolution which can save on-chip memory space. Lastly, we modified the original YOLO algorithm to further optimize it in our hardware. The modification includes reducing the bit precision to reduce memory space and bandwidth requirement, merging the normalization layer with the convolution layer to reduce arithmetic complexity, and adding a new DLQ layer to keep the bit size small while maintaining the accuracy. Based on the experimental results, our proposed design manages to achieve a frame rate of 8.3 FPS with the throughput of 31.5 GOPS, outperforming the same convolution computation that is performed by Ryzen 5 3600 CPU up to 69.3\times in latency. Moreover, our proposed design also has the smallest clock cycle ratio up to 1.75\times than other commercial accelerators. The system is useful and suitable for edge computing applications. |
---|---|
ISSN: | 2169-3536 2169-3536 |
DOI: | 10.1109/ACCESS.2021.3120629 |