OSM: Off-Chip Shared Memory for GPUs

Graphics Processing Units (GPUs) employ a shared memory, a software-managed cache for programmers, in each streaming multiprocessor to accelerate data sharing among the threads in a thread block. Although 60% of the shared memory space is underutilized, on average, there are some workloads that dema...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on parallel and distributed systems 2022-12, Vol.33 (12), p.1-1
Hauptverfasser:	Darabi, Sina, Yousefzadeh-Asl-Miandoab, Ehsan, Akbarzadeh, Negar, Falahati, Hajar, Lotfi-Kamran, Pejman, Sadrosadati, Mohammad, Sarbazi-Azad, Hamid
Format:	Artikel
Sprache:	eng
Schlagworte:	Bandwidth Cache Memory Chips (memory devices) Data retrieval GPUs Graphics processing units Instruction sets Lifetime Memory management Multiprocessing Off-chip Memory Proposals Registers Shared Memory System-on-chip Workload Workloads
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Graphics Processing Units (GPUs) employ a shared memory, a software-managed cache for programmers, in each streaming multiprocessor to accelerate data sharing among the threads in a thread block. Although 60% of the shared memory space is underutilized, on average, there are some workloads that demand higher shared memory capacities. Therefore, improving shared memory utilization while satisfying the needs of shared memory intensive workloads is challenging. We make a key observation that the lifetime of each shared memory address is significantly shorter than the execution time of a thread block. In this paper, we first propose Off-Chip Shared Memory (OSM) that allocates shared memory space in the off-chip memory and accelerates accesses to it via a small on-chip cache. Using an 8KB cache for shared memory addresses, OSM provides almost the same performance as the baseline GPU that uses 96KB on-chip shared memory. OSM improves GPU performance in two ways. First, it allocates higher shared memory capacities in the off-chip memory, and improves thread-level parallelism (TLP). Second, it designs a unified cache for shared memory and global address spaces, providing more caching space for global memory address space even for the workloads with high shared memory utilization. Our experimental results show an average 21% and 18% IPC improvement compared to the baseline and the state-of-the-art architectures.
ISSN:	1045-9219 1558-2183
DOI:	10.1109/TPDS.2022.3154315