Software-managed automatic data sharing for Coarse-Grained Reconfigurable coprocessors

Coarse-Grained Reconfigurable Architecture (CGRA) in a hybrid system can significantly accelerate the execution of compute-intensive kernels of applications. However, the data communication overhead between the main processor (MP) and the CGRA may be huge and can negate the speed-up of the CGRA. In...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Mai, T. X., Jongeun Lee
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Arrays Coprocessors Customer relationship management Data transfer Kernel
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Coarse-Grained Reconfigurable Architecture (CGRA) in a hybrid system can significantly accelerate the execution of compute-intensive kernels of applications. However, the data communication overhead between the main processor (MP) and the CGRA may be huge and can negate the speed-up of the CGRA. In this paper we address the problem of reducing the data communication overhead in a hybrid system by offering a partially automatic data sharing technique using a special shared memory called Configurable Range Memory (CRM). Unlike the previous work the CRM architecture we use here is based on comparators, which gives much higher flexibility in terms of where an array can be placed within a CRM while it makes the runtime software management of a CRM much more challenging. We present an efficient runtime algorithm based on first-fit heuristic. Our experimental results demonstrate that our CRM-based system can reduce the amount of data transfer between a MP and a CGRA up to 89.5% compared to ScratchPad Memory (SPM)-based systems, while the software management overhead is only 1.20~1.34% on average (depending on CRM architecture parameters) of the kernel cycles in the MP-only execution. Overall our CRM-based system can achieve average kernel speedup of 3.47 times over the MP-only execution, which is about 20% improvement over the SPM-based system.
DOI:	10.1109/FPT.2012.6412148