Energy efficiency of finite difference algorithms on multicore CPUs, GPUs, and Intel Xeon Phi processors

In addition to hardware wall-time restrictions commonly seen in high-performance computing systems, it is likely that future systems will also be constrained by energy budgets. In the present work, finite difference algorithms of varying computational and memory intensity are evaluated with respect...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Jammy, Satya P, Jacobs, Christian T, Lusher, David J, Sandham, Neil D
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Mathematical Software Computer Science - Performance Physics - Computational Physics Physics - Fluid Dynamics
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In addition to hardware wall-time restrictions commonly seen in high-performance computing systems, it is likely that future systems will also be constrained by energy budgets. In the present work, finite difference algorithms of varying computational and memory intensity are evaluated with respect to both energy efficiency and runtime on an Intel Ivy Bridge CPU node, an Intel Xeon Phi Knights Landing processor, and an NVIDIA Tesla K40c GPU. The conventional way of storing the discretised derivatives to global arrays for solution advancement is found to be inefficient in terms of energy consumption and runtime. In contrast, a class of algorithms in which the discretised derivatives are evaluated on-the-fly or stored as thread-/process-local variables (yielding high compute intensity) is optimal both with respect to energy consumption and runtime. On all three hardware architectures considered, a speed-up of ~2 and an energy saving of ~2 are observed for the high compute intensive algorithms compared to the memory intensive algorithm. The energy consumption is found to be proportional to runtime, irrespective of the power consumed and the GPU has an energy saving of ~5 compared to the same algorithm on a CPU node.
DOI:	10.48550/arxiv.1709.09713