Fast and Scalable Multicore YOLOv3-Tiny Accelerator Using Input Stationary Systolic Architecture

This article proposes a scalable accelerator for deep learning (DL) implementation on edge computing, which is often limited by power, storage, and computation speed. The accelerator is based on systolic array cores with 126 processing elements (PEs) and optimized for YOLOv3-Tiny with 448 \(\times\...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on very large scale integration (VLSI) systems 2023-11, Vol.31 (11), p.1-14
Hauptverfasser:	Adiono, Trio, Ramadhan, Rhesa Muhammad, Sutisna, Nana, Syafalni, Infall, Mulyawan, Rahmat, Lin, Chang-Hong
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer architecture Control data (computers) Convolutional neural networks (CNNs) Data transmission Edge computing Feature maps Frames per second input stationary (IS) dataflow Multicasting scalable accelerator systolic array YOLOv3-Tiny
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This article proposes a scalable accelerator for deep learning (DL) implementation on edge computing, which is often limited by power, storage, and computation speed. The accelerator is based on systolic array cores with 126 processing elements (PEs) and optimized for YOLOv3-Tiny with 448 \(\times\) 448 input images. Two multicast (MC) network architectures, feature map multicasting and weight multicasting, are introduced to control data stream distribution within the multicores. Results show that the proposed weight multicast (W-MC) systems outperformed the feature map multicast (FMAP-MC) systems in multicore scenarios, with up to 2.23 \(\times\) frame rates per second (FPS). The 4-core W-MC system achieved the best efficiency with an overall frame rate of 13.73 FPS/W and an overall throughput of 35.83 GOPS/W. The 8-core W-MC system delivered the best performance, with a frame rate of 38.50 FPS after normalization to the standard YOLOv3-Tiny network. The proposed accelerator offers better computational efficiency and greater accelerator utilization in real-world inference scenarios, compared to previous state-of-the-art works.
ISSN:	1063-8210 1557-9999
DOI:	10.1109/TVLSI.2023.3305937