Pushing the limits for medical image reconstruction on recent standard multicore processors

Volume reconstruction by backprojection is the computational bottleneck in many interventional clinical computed tomography (CT) applications. Today vendors in this field replace special purpose hardware accelerators with standard hardware such as multicore chips and GPGPUs. Medical imaging algorith...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	The international journal of high performance computing applications 2013-05, Vol.27 (2), p.162-177
Hauptverfasser:	Treibig, Jan, Hager, Georg, Hofmann, Hannes G., Hornegger, Joachim, Wellein, Gerhard
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Applied sciences Artificial intelligence Assembly language Benchmarking Biological and medical sciences Central processing units Computation Computer science control theory systems Computer systems and distributed systems. User interface CPUs Exact sciences and technology Hardware High performance computing Integrated circuits Investigative techniques, diagnostic techniques (general aspects) Language processing and microprogramming Medical sciences Microprocessors Optimization Optimization algorithms Pattern recognition. Digital image processing. Computational geometry Processors Software Studies Tomography
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Volume reconstruction by backprojection is the computational bottleneck in many interventional clinical computed tomography (CT) applications. Today vendors in this field replace special purpose hardware accelerators with standard hardware such as multicore chips and GPGPUs. Medical imaging algorithms are on the verge of employing high-performance computing (HPC) technology, and are therefore an interesting new candidate for optimization. This paper presents low-level optimizations for the backprojection algorithm, guided by a thorough performance analysis on four generations of Intel multicore processors (Harpertown, Westmere, Westmere EX, and Sandy Bridge). We choose the RabbitCT benchmark, a standardized testcase well supported in industry, to ensure transparent and comparable results. Our aim is to provide not only the fastest possible implementation but also compare with performance models and hardware counter data in order to fully understand the results. We separate the influence of algorithmic optimizations, parallelization, SIMD vectorization, and microarchitectural issues and pinpoint problems with current SIMD instruction set extensions on standard CPUs (SSE, AVX). The use of assembly language is mandatory for best performance. Finally, we compare our results to the best GPGPU implementations available for this open competition benchmark.
ISSN:	1094-3420 1741-2846
DOI:	10.1177/1094342012442424