System-Level Communication Performance Estimation for DMA-Controlled Accelerators

The performance of a hardware accelerator is often limited by the communication bandwidth between local on-chip memories and DRAM across on-chip bus. In this paper, a system-level performance estimation algorithm is newly proposed for evaluating the communication performance of direct memory access...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2021, Vol.9, p.141389-141402
Hauptverfasser: Kim, Sunwoo, Park, Sungkyung, Park, Chester Sungchung
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The performance of a hardware accelerator is often limited by the communication bandwidth between local on-chip memories and DRAM across on-chip bus. In this paper, a system-level performance estimation algorithm is newly proposed for evaluating the communication performance of direct memory access (DMA) controlled accelerators. The proposed algorithm can estimate the communication performance accurately for both DRAM-limited and bus-limited cases. In detail, the communication performance for the DRAM-limited case is estimated using dynamic prediction of DRAM command patterns whereas the communication performance for the bus-limited case is estimated based on the maximum bus burst latency. Depending on whether the communication bandwidth is limited by the bus protocol overhead or the DRAM latency, the proposed algorithm estimates the communication bandwidth of a DMA-controlled accelerator according to the performance bottleneck. It is shown that the proposed algorithm significantly improves the estimation accuracy when it is applied to CNNs and wireless communications. Also, when the proposed algorithm together with a full-system simulator is used to explore a design space defined by a set of tile sizes and bus-related parameters, it speeds up conventional algorithms by more than a factor of 100 by filtering out a large number of unpromising design points. It is also shown that the proposed algorithm alone can approach the maximum accelerator performance with a performance degradation of less than 5%. An ablation study is applied to prove the efficacy of individual steps of the proposed algorithm.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2021.3119516