Improving Utilization of Dataflow Unit for Multi-Batch Processing

Dataflow architectures can achieve much better performance and higher efficiency than general-purpose core, approaching the performance of a specialized design while retaining programmability. However, advanced application scenarios place higher demands on the hardware in terms of cross-domain and m...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ACM transactions on architecture and code optimization 2024-02, Vol.21 (1), p.1-26, Article 17
Hauptverfasser:	Fan, Zhihua, Li, Wenming, Wang, Zhen, Yang, Yu, Ye, Xiaochun, Fan, Dongrui, Sun, Ninghui, An, Xuejun
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer systems organization Data flow architectures
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Dataflow architectures can achieve much better performance and higher efficiency than general-purpose core, approaching the performance of a specialized design while retaining programmability. However, advanced application scenarios place higher demands on the hardware in terms of cross-domain and multi-batch processing. In this article, we propose a unified scale-vector architecture that can work in multiple modes and adapt to diverse algorithms and requirements efficiently. First, a novel reconfigurable interconnection structure is proposed, which can organize execution units into different cluster typologies as a way to accommodate different data-level parallelism. Second, we decouple threads within each DFG node into consecutive pipeline stages and provide architectural support. By time-multiplexing during these stages, dataflow hardware can achieve much higher utilization and performance. In addition, the task-based program model can also exploit multi-level parallelism and deploy applications efficiently. Evaluated in a wide range of benchmarks, including digital signal processing algorithms, CNNs, and scientific computing algorithms, our design attains up to 11.95× energy efficiency (performance-per-watt) improvement over GPU (V100), and 2.01× energy efficiency improvement over state-of-the-art dataflow architectures.
ISSN:	1544-3566 1544-3973
DOI:	10.1145/3637906