A hierarchical parallel implementation for heterogeneous computing. Application to algebra-based CFD simulations on hybrid supercomputers
•Algebra-based simulation approach for incompressible turbulent flows with heat transfer.•Efficient heterogeneous execution of computing kernels with halo update on CPU+GPU.•Overlap of computations and communications, multithreaded data exchange processing.•NUMA-aware OpenMP parallelization for comp...
Gespeichert in:
Veröffentlicht in: | Computers & fluids 2021-01, Vol.214, p.104768, Article 104768 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | •Algebra-based simulation approach for incompressible turbulent flows with heat transfer.•Efficient heterogeneous execution of computing kernels with halo update on CPU+GPU.•Overlap of computations and communications, multithreaded data exchange processing.•NUMA-aware OpenMP parallelization for computing on manycore CPUs and managing devices.•Detailed performance study of the SpMV kernel on various supercomputer architectures.
The quest for new portable implementations of simulation algorithms is motivated by the increasing variety of computing architectures. Moreover, the hybridization of high-performance computing systems imposes additional constraints, since heterogeneous computations are needed to efficiently engage processors and massively-parallel accelerators. This, in turn, involves different parallel paradigms and computing frameworks and requires complex data exchanges between computing units. Typically, simulation codes rely on sophisticated data structures and computing subroutines, so-called kernels, which makes portability terribly cumbersome. Thus, a natural way to achieve portability is to dramatically reduce the complexity of both data structures and computing kernels. In our algebra-based approach, the scale-resolving simulation of incompressible turbulent flows on unstructured meshes relies on three fundamental kernels: the sparse matrix-vector product, the linear combination of vectors and the dot product. It is noteworthy that this approach is not limited to a particular kind of numerical method or a set of governing equations. In our code, an auto-balanced multilevel partitioning distributes workload among computing devices of various architectures. The overlap of computations and multistage communications efficiently hides the data exchanges overhead in large-scale supercomputer simulations. In addition to computing on accelerators, special attention is paid at efficiency on manycore processors in multiprocessor nodes with significant non-uniform memory access factor. Parallel efficiency and performance are studied in detail for different execution modes on various supercomputers using up to 9,600 processor cores and up to 256 graphics processor units. The heterogeneous implementation model described in this work is a general-purpose approach that is well suited for various subroutines in numerical simulation codes. |
---|---|
ISSN: | 0045-7930 1879-0747 |
DOI: | 10.1016/j.compfluid.2020.104768 |