BL-PIM: Varying the Burst Length to Realize the All-bank Performance and Minimize the Multi-Workload Interference for in-DRAM PIM
As the demand for transformer applications increases rapidly, technologies to solve memory bottlenecks are attracting attention. One of them is an in-DRAM Processing-In-Memory (PIM) architecture to perform the computation inside DRAM. Major DRAM makers introduce the PIM samples, executing all banks&...
Gespeichert in:
Veröffentlicht in: | IEEE access 2023-01, Vol.11, p.1-1 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | As the demand for transformer applications increases rapidly, technologies to solve memory bottlenecks are attracting attention. One of them is an in-DRAM Processing-In-Memory (PIM) architecture to perform the computation inside DRAM. Major DRAM makers introduce the PIM samples, executing all banks' computations simultaneously to maximize the internal DRAM bandwidth for achieving the highest performance. However, the realization as a commercial product is problematic since the all-bank execution does not concurrently perform non-PIM applications during the PIM execution with PIM memory, thus separating their memory space. This paper proposes a BL-PIM architecture to increase the burst length (BL) of memory requests inside a bank to maximize internal bandwidth and overlap the computation across banks, thus achieving all-bank performance. On the other hand, outside a bank, it seems not to increase the BL, thus allowing us to preserve the data consistency in memory hierarchy and execute non-PIM and PIM applications together with PIM memory. Also, the memory-intensive PIM computation using larger BL significantly reduces the number of memory requests, thus minimizing the performance interference with other applications. We carefully extend the DRAM timing diagram and develop the cooperation mechanism between a memory controller and a PIM device. We implemented the BL-PIM architecture on FPGA and compared the performance with real machines using four transformer models and eight compute and memory-bound SPEC benchmarks.We achieved the BL-PIM performance up to 28.9x and 12.0x faster than the CPU single-thread and multi-threaded execution in the transformer models. Also, when we increased the burst length by 16 times as the maximum, the BL-PIM was 1.2x faster than the ideal all-bank PIM execution. We also experimented with the multi-workload execution using the SPEC benchmarks, showing that our architecture can minimize performance interference. To our knowledge, the study of the PIM's multi-workload execution is the first in public. |
---|---|
ISSN: | 2169-3536 2169-3536 |
DOI: | 10.1109/ACCESS.2023.3300893 |