Performance Optimization of Multithreaded 2D Fast Fourier Transform on Multicore Processors Using Load Imbalancing Parallel Computing Method

Fast Fourier transform (FFT) is a key routine employed in application domains such as molecular dynamics, computational fluid dynamics, signal processing, image processing, and condition monitoring systems. Its performance on modern multicore platforms is therefore of paramount concern to the high-p...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2018, Vol.6, p.64202-64224
Hauptverfasser: Khokhriakov, Semyon, Manumachu, Ravi Reddy, Lastovetsky, Alexey
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Fast Fourier transform (FFT) is a key routine employed in application domains such as molecular dynamics, computational fluid dynamics, signal processing, image processing, and condition monitoring systems. Its performance on modern multicore platforms is therefore of paramount concern to the high-performance computing community. The inherent complexities in these platforms such as severe resource contention and non-uniform memory access, however, pose formidable challenges. We study the performance profiles of multithreaded 2D FFTs provided in three highly optimized packages, FFTW-2.1.5, FFTW-3.3.7, and Intel Math Kernel Library (Intel MKL) FFT, on a modern Intel Haswell multicore processor consisting of 36 cores. We show that all the three routines exhibit drastic performance variations, and hence, their average performances are considerably lower than their peak performances. The ratios of average-to-peak performance for the 2D FFT routines from the three packages are 40%, 30%, and 24%. We conclude that improving the average performance of 2D FFT on modern multicore processors by the removal of performance variations constitutes a tremendous research challenge. To address this challenge, we propose two novel optimization methods, PFFT-FPM and PFFT-FPM-PAD, specifically designed and implemented for 2D FFT. The methods employ model-based parallel computing using a load-imbalancing technique. They take as inputs, the discrete 3D functions of the performance of the processors against problem size, compute 2D DFT of a complex signal matrix of size N \times N using p abstract processors, and output the transformed signal matrix. Based on our experiments on a modern Intel Haswell multicore server consisting of 36 physical cores, the average and maximum speedups observed for PFFT-FPM using FFTW-3.3.7 are 1.9\times and 6.8\times , and the average and maximum speedups observed using Intel MKL FFT are 1.3\times and 2\times . The average and maximum speedups observed for PFFT-FPM-PAD using FFTW-3.3.7 are
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2018.2878271