Logic-Base Interconnect Design for Near Memory Computing in the Smart Memory Cube
Hybrid memory cube (HMC) has promised to improve bandwidth, power consumption, and density for the next-generation main memory systems. In addition, 3-D integration gives a second shot for revisiting near memory computation to fill the gap between processors and memories. In this paper, we study the...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on very large scale integration (VLSI) systems 2017-01, Vol.25 (1), p.210-223 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Hybrid memory cube (HMC) has promised to improve bandwidth, power consumption, and density for the next-generation main memory systems. In addition, 3-D integration gives a second shot for revisiting near memory computation to fill the gap between processors and memories. In this paper, we study the required infrastructure inside the HMC to support near memory computation in a modular and flexible fashion. We propose a fully backward compatible extension to the standard HMC called the smart memory cube, and design a high bandwidth, low latency, and Advanced eXtensible Interface-4.0 compatible logic base (LoB) interconnect to serve the huge bandwidth demand by the HMCs serial links, and to provide extra bandwidth to a generic processor-in-memory (PIM) device embedded in the LoB. This interconnect features a novel address scrambling mechanism for the reduction in the vault/bank conflicts and robust operation even in the presence of pathological traffic patterns. Our cycle accurate simulation results demonstrate that this interconnect can easily meet the demands of the latest HMC specifications (up to 205 GB/s read bandwidth with 4 serial links and 32 memory vaults for injected random traffic). It further shown that the default addressing scheme of the HMC (low interleaving) is not reliable enough and operates poorly in the presence of specific traffic patterns from real applications. This is while the proposed scrambling mechanism operates robustly even in those cases. The interference between the PIM traffic and the main links is shown to be negligible when the number of PIM ports is limited to 2, requesting up to 64 GB/s without pushing the system into saturation. Finally, logic synthesis with Synopsys Design Compiler confirms that our interconnect is implementable and effective in terms of power, area, and timing (power consumption less than 5 mW up to 1 GHz and area less than 0.4 mm 2 ). |
---|---|
ISSN: | 1063-8210 1557-9999 |
DOI: | 10.1109/TVLSI.2016.2570283 |