Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor

Several emerging petascale architectures use energy-efficient processors with vectorized computational units and in-order thread processing. On these architectures the sustained performance of streaming numerical kernels, ubiquitous in the solution of partial differential equations, represents a cha...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	The international journal of high performance computing applications 2013-05, Vol.27 (2), p.193-209
Hauptverfasser:	Malas, Tareq, Ahmadia, Aron J., Brown, Jed, Gunnels, John A., Keyes, David E.
Format:	Artikel
Sprache:	eng
Schlagworte:	Applied sciences Assembly Cache Central processing units Computation Computer science control theory systems Computer systems and distributed systems. User interface Construction CPUs Energy efficiency Exact sciences and technology Integrated circuits Kernels Mathematical models Microprocessors Optimization Optimization techniques Partial differential equations Programming languages Simulation Software Studies Three dimensional
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Several emerging petascale architectures use energy-efficient processors with vectorized computational units and in-order thread processing. On these architectures the sustained performance of streaming numerical kernels, ubiquitous in the solution of partial differential equations, represents a challenge despite the regularity of memory access. Sophisticated optimization techniques are required to fully utilize the CPU. We propose a new method for constructing streaming numerical kernels using a high-level assembly synthesis and optimization framework. We describe an implementation of this method in Python targeting the IBM® Blue Gene®/P supercomputer’s PowerPC® 450 core. This paper details the high-level design, construction, simulation, verification, and analysis of these kernels utilizing a subset of the CPU’s instruction set. We demonstrate the effectiveness of our approach by implementing several three-dimensional stencil kernels over a variety of cached memory scenarios and analyzing the mechanically scheduled variants, including a 27-point stencil achieving a 1.7 × speedup over the best previously published results.
ISSN:	1094-3420 1741-2846
DOI:	10.1177/1094342012444795