Not All GPUs Are Created Equal: Characterizing Variability in Large-Scale, Accelerator-Rich Systems
Scientists are increasingly exploring and utilizing the massive parallelism of general-purpose accelerators such as GPUs for scientific breakthroughs. As a result, datacenters, hyperscalers, national computing centers, and supercomputers have procured hardware to support this evolving application pa...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Scientists are increasingly exploring and utilizing the massive parallelism
of general-purpose accelerators such as GPUs for scientific breakthroughs. As a
result, datacenters, hyperscalers, national computing centers, and
supercomputers have procured hardware to support this evolving application
paradigm. These systems contain hundreds to tens of thousands of accelerators,
enabling peta- and exa-scale levels of compute for scientific workloads. Recent
work demonstrated that power management (PM) can impact application performance
in CPU-based HPC systems, even when machines have the same architecture and SKU
(stock keeping unit). This variation occurs due to manufacturing variability
and the chip's PM. However, while modern HPC systems widely employ accelerators
such as GPUs, it is unclear how much this variability affects applications.
Accordingly, we seek to characterize the extent of variation due to GPU PM in
modern HPC and supercomputing systems. We study a variety of applications that
stress different GPU components on five large-scale computing centers with
modern GPUs: Oak Ridge's Summit, Sandia's Vortex, TACC's Frontera and Longhorn,
and Livermore's Corona. These clusters use a variety of cooling methods and GPU
vendors. In total, we collect over 18,800 hours of data across more than 90% of
the GPUs in these clusters. Regardless of the application, cluster, GPU vendor,
and cooling method, our results show significant variation: 8% (max 22%)
average performance variation even though the GPU architecture and vendor SKU
are identical within each cluster, with outliers up to 1.5X slower than the
median GPU. These results highlight the difficulty in efficiently using
existing GPU clusters for modern HPC and scientific workloads, and the need to
embrace variability in future accelerator-based systems. |
---|---|
DOI: | 10.48550/arxiv.2208.11035 |