Optimizing High-Performance Linpack for Exascale Accelerated Architectures
We detail the performance optimizations made in rocHPL, AMD's open-source implementation of the High-Performance Linpack (HPL) benchmark targeting accelerated node architectures designed for exascale systems such as the Frontier supercomputer. The implementation leverages the high-throughput GP...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We detail the performance optimizations made in rocHPL, AMD's open-source
implementation of the High-Performance Linpack (HPL) benchmark targeting
accelerated node architectures designed for exascale systems such as the
Frontier supercomputer. The implementation leverages the high-throughput GPU
accelerators on the node via highly optimized linear algebra libraries, as well
as the entire CPU socket to perform latency-sensitive factorization phases. We
detail novel performance improvements such as a multi-threaded approach to
computing the panel factorization phase on the CPU, time-sharing of CPU cores
between processes on the node, as well as several optimizations which hide MPI
communication. We present some performance results of this implementation of
the HPL benchmark on a single node of the Frontier early access cluster at Oak
Ridge National Laboratory, as well as scaling to multiple nodes. |
---|---|
DOI: | 10.48550/arxiv.2304.10397 |