A hierarchical parallel implementation for heterogeneous computing. Application to algebra-based CFD simulations on hybrid supercomputers

•Algebra-based simulation approach for incompressible turbulent flows with heat transfer.•Efficient heterogeneous execution of computing kernels with halo update on CPU+GPU.•Overlap of computations and communications, multithreaded data exchange processing.•NUMA-aware OpenMP parallelization for comp...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Computers & fluids 2021-01, Vol.214, p.104768, Article 104768
Hauptverfasser:	Álvarez-Farré, Xavier, Gorobets, Andrey, Trias, F. Xavier
Format:	Artikel
Sprache:	eng
Schlagworte:	Accelerators Algorithms Complexity Computational efficiency Computational fluid dynamics Computer simulation CPU+GPU CUDA Data exchange Data structures Fluid flow Heterogeneous computing Hybrid supercomputer Incompressible flow Kernels Mathematical analysis Mathematical models Matrix algebra Matrix methods Microprocessors MPI+OpenMP+OpenCL Multiprocessing Numerical methods Parallel CFD Portability Processors Simulation SpMV Subroutines Supercomputers Vectors (mathematics)
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•Algebra-based simulation approach for incompressible turbulent flows with heat transfer.•Efficient heterogeneous execution of computing kernels with halo update on CPU+GPU.•Overlap of computations and communications, multithreaded data exchange processing.•NUMA-aware OpenMP parallelization for computing on manycore CPUs and managing devices.•Detailed performance study of the SpMV kernel on various supercomputer architectures. The quest for new portable implementations of simulation algorithms is motivated by the increasing variety of computing architectures. Moreover, the hybridization of high-performance computing systems imposes additional constraints, since heterogeneous computations are needed to efficiently engage processors and massively-parallel accelerators. This, in turn, involves different parallel paradigms and computing frameworks and requires complex data exchanges between computing units. Typically, simulation codes rely on sophisticated data structures and computing subroutines, so-called kernels, which makes portability terribly cumbersome. Thus, a natural way to achieve portability is to dramatically reduce the complexity of both data structures and computing kernels. In our algebra-based approach, the scale-resolving simulation of incompressible turbulent flows on unstructured meshes relies on three fundamental kernels: the sparse matrix-vector product, the linear combination of vectors and the dot product. It is noteworthy that this approach is not limited to a particular kind of numerical method or a set of governing equations. In our code, an auto-balanced multilevel partitioning distributes workload among computing devices of various architectures. The overlap of computations and multistage communications efficiently hides the data exchanges overhead in large-scale supercomputer simulations. In addition to computing on accelerators, special attention is paid at efficiency on manycore processors in multiprocessor nodes with significant non-uniform memory access factor. Parallel efficiency and performance are studied in detail for different execution modes on various supercomputers using up to 9,600 processor cores and up to 256 graphics processor units. The heterogeneous implementation model described in this work is a general-purpose approach that is well suited for various subroutines in numerical simulation codes.
ISSN:	0045-7930 1879-0747
DOI:	10.1016/j.compfluid.2020.104768