Fast and Scalable Multicore YOLOv3-Tiny Accelerator Using Input Stationary Systolic Architecture
This article proposes a scalable accelerator for deep learning (DL) implementation on edge computing, which is often limited by power, storage, and computation speed. The accelerator is based on systolic array cores with 126 processing elements (PEs) and optimized for YOLOv3-Tiny with 448 \(\times\...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on very large scale integration (VLSI) systems 2023-11, Vol.31 (11), p.1-14 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This article proposes a scalable accelerator for deep learning (DL) implementation on edge computing, which is often limited by power, storage, and computation speed. The accelerator is based on systolic array cores with 126 processing elements (PEs) and optimized for YOLOv3-Tiny with 448 \(\times\) 448 input images. Two multicast (MC) network architectures, feature map multicasting and weight multicasting, are introduced to control data stream distribution within the multicores. Results show that the proposed weight multicast (W-MC) systems outperformed the feature map multicast (FMAP-MC) systems in multicore scenarios, with up to 2.23 \(\times\) frame rates per second (FPS). The 4-core W-MC system achieved the best efficiency with an overall frame rate of 13.73 FPS/W and an overall throughput of 35.83 GOPS/W. The 8-core W-MC system delivered the best performance, with a frame rate of 38.50 FPS after normalization to the standard YOLOv3-Tiny network. The proposed accelerator offers better computational efficiency and greater accelerator utilization in real-world inference scenarios, compared to previous state-of-the-art works. |
---|---|
ISSN: | 1063-8210 1557-9999 |
DOI: | 10.1109/TVLSI.2023.3305937 |