Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor

Several emerging petascale architectures use energy-efficient processors with vectorized computational units and in-order thread processing. On these architectures the sustained performance of streaming numerical kernels, ubiquitous in the solution of partial differential equations, represents a cha...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:The international journal of high performance computing applications 2013-05, Vol.27 (2), p.193-209
Hauptverfasser: Malas, Tareq, Ahmadia, Aron J., Brown, Jed, Gunnels, John A., Keyes, David E.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 209
container_issue 2
container_start_page 193
container_title The international journal of high performance computing applications
container_volume 27
creator Malas, Tareq
Ahmadia, Aron J.
Brown, Jed
Gunnels, John A.
Keyes, David E.
description Several emerging petascale architectures use energy-efficient processors with vectorized computational units and in-order thread processing. On these architectures the sustained performance of streaming numerical kernels, ubiquitous in the solution of partial differential equations, represents a challenge despite the regularity of memory access. Sophisticated optimization techniques are required to fully utilize the CPU. We propose a new method for constructing streaming numerical kernels using a high-level assembly synthesis and optimization framework. We describe an implementation of this method in Python targeting the IBM® Blue Gene®/P supercomputer’s PowerPC® 450 core. This paper details the high-level design, construction, simulation, verification, and analysis of these kernels utilizing a subset of the CPU’s instruction set. We demonstrate the effectiveness of our approach by implementing several three-dimensional stencil kernels over a variety of cached memory scenarios and analyzing the mechanically scheduled variants, including a 27-point stencil achieving a 1.7 × speedup over the best previously published results.
doi_str_mv 10.1177/1094342012444795
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_1777997627</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sage_id>10.1177_1094342012444795</sage_id><sourcerecordid>1365157121</sourcerecordid><originalsourceid>FETCH-LOGICAL-c405t-a64e7f097bfeb08a32da99936a421f245d78435c57e8e33ff56869a4289e99243</originalsourceid><addsrcrecordid>eNqFkc1LAzEQxRdRsFbvHgMieFlNsvk82qK1UGkPel7S7aRu3d3UZBfRv97UFpGCeErg_ebNzJskOSf4mhApbwjWLGMUE8oYk5ofJD0iGUmpYuIw_qOcbvTj5CSEFcZYsIz3Ejtdt2VdfpbNErUvgNbgrfO1aQpAzqLQejD1Rmy6GnxZmAq9gm-gCsg13xXjwSMaVB2gETRwM0Mz9w5-NkSMY7T2roAQnD9NjqypApzt3n7yfH_3NHxIJ9PReHg7SQuGeZsawUBarOXcwhwrk9GF0VpnwjBKLGV8IVUcu-ASFGSZtVwooaOoNGhNWdZPrra-sfNbB6HN6zIUUFWmAdeFPCYltZaCyv_RTHDCJaEkohd76Mp1vomLRIppqeJQOlJ4SxXeheDB5mtf1sZ_5ATnmxvl-zeKJZc7YxNitNbH3MvwUxenpEQoFbl0ywWzhF_N__L9AtkFms0</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1349788439</pqid></control><display><type>article</type><title>Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor</title><source>SAGE Journals</source><source>Alma/SFX Local Collection</source><creator>Malas, Tareq ; Ahmadia, Aron J. ; Brown, Jed ; Gunnels, John A. ; Keyes, David E.</creator><creatorcontrib>Malas, Tareq ; Ahmadia, Aron J. ; Brown, Jed ; Gunnels, John A. ; Keyes, David E.</creatorcontrib><description>Several emerging petascale architectures use energy-efficient processors with vectorized computational units and in-order thread processing. On these architectures the sustained performance of streaming numerical kernels, ubiquitous in the solution of partial differential equations, represents a challenge despite the regularity of memory access. Sophisticated optimization techniques are required to fully utilize the CPU. We propose a new method for constructing streaming numerical kernels using a high-level assembly synthesis and optimization framework. We describe an implementation of this method in Python targeting the IBM® Blue Gene®/P supercomputer’s PowerPC® 450 core. This paper details the high-level design, construction, simulation, verification, and analysis of these kernels utilizing a subset of the CPU’s instruction set. We demonstrate the effectiveness of our approach by implementing several three-dimensional stencil kernels over a variety of cached memory scenarios and analyzing the mechanically scheduled variants, including a 27-point stencil achieving a 1.7 × speedup over the best previously published results.</description><identifier>ISSN: 1094-3420</identifier><identifier>EISSN: 1741-2846</identifier><identifier>DOI: 10.1177/1094342012444795</identifier><language>eng</language><publisher>London, England: SAGE Publications</publisher><subject>Applied sciences ; Assembly ; Cache ; Central processing units ; Computation ; Computer science; control theory; systems ; Computer systems and distributed systems. User interface ; Construction ; CPUs ; Energy efficiency ; Exact sciences and technology ; Integrated circuits ; Kernels ; Mathematical models ; Microprocessors ; Optimization ; Optimization techniques ; Partial differential equations ; Programming languages ; Simulation ; Software ; Studies ; Three dimensional</subject><ispartof>The international journal of high performance computing applications, 2013-05, Vol.27 (2), p.193-209</ispartof><rights>The Author(s) 2012</rights><rights>2014 INIST-CNRS</rights><rights>Copyright SAGE PUBLICATIONS, INC. May 2013</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c405t-a64e7f097bfeb08a32da99936a421f245d78435c57e8e33ff56869a4289e99243</citedby><cites>FETCH-LOGICAL-c405t-a64e7f097bfeb08a32da99936a421f245d78435c57e8e33ff56869a4289e99243</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://journals.sagepub.com/doi/pdf/10.1177/1094342012444795$$EPDF$$P50$$Gsage$$H</linktopdf><linktohtml>$$Uhttps://journals.sagepub.com/doi/10.1177/1094342012444795$$EHTML$$P50$$Gsage$$H</linktohtml><link.rule.ids>314,776,780,21798,27901,27902,43597,43598</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&amp;idt=27321688$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Malas, Tareq</creatorcontrib><creatorcontrib>Ahmadia, Aron J.</creatorcontrib><creatorcontrib>Brown, Jed</creatorcontrib><creatorcontrib>Gunnels, John A.</creatorcontrib><creatorcontrib>Keyes, David E.</creatorcontrib><title>Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor</title><title>The international journal of high performance computing applications</title><description>Several emerging petascale architectures use energy-efficient processors with vectorized computational units and in-order thread processing. On these architectures the sustained performance of streaming numerical kernels, ubiquitous in the solution of partial differential equations, represents a challenge despite the regularity of memory access. Sophisticated optimization techniques are required to fully utilize the CPU. We propose a new method for constructing streaming numerical kernels using a high-level assembly synthesis and optimization framework. We describe an implementation of this method in Python targeting the IBM® Blue Gene®/P supercomputer’s PowerPC® 450 core. This paper details the high-level design, construction, simulation, verification, and analysis of these kernels utilizing a subset of the CPU’s instruction set. We demonstrate the effectiveness of our approach by implementing several three-dimensional stencil kernels over a variety of cached memory scenarios and analyzing the mechanically scheduled variants, including a 27-point stencil achieving a 1.7 × speedup over the best previously published results.</description><subject>Applied sciences</subject><subject>Assembly</subject><subject>Cache</subject><subject>Central processing units</subject><subject>Computation</subject><subject>Computer science; control theory; systems</subject><subject>Computer systems and distributed systems. User interface</subject><subject>Construction</subject><subject>CPUs</subject><subject>Energy efficiency</subject><subject>Exact sciences and technology</subject><subject>Integrated circuits</subject><subject>Kernels</subject><subject>Mathematical models</subject><subject>Microprocessors</subject><subject>Optimization</subject><subject>Optimization techniques</subject><subject>Partial differential equations</subject><subject>Programming languages</subject><subject>Simulation</subject><subject>Software</subject><subject>Studies</subject><subject>Three dimensional</subject><issn>1094-3420</issn><issn>1741-2846</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2013</creationdate><recordtype>article</recordtype><recordid>eNqFkc1LAzEQxRdRsFbvHgMieFlNsvk82qK1UGkPel7S7aRu3d3UZBfRv97UFpGCeErg_ebNzJskOSf4mhApbwjWLGMUE8oYk5ofJD0iGUmpYuIw_qOcbvTj5CSEFcZYsIz3Ejtdt2VdfpbNErUvgNbgrfO1aQpAzqLQejD1Rmy6GnxZmAq9gm-gCsg13xXjwSMaVB2gETRwM0Mz9w5-NkSMY7T2roAQnD9NjqypApzt3n7yfH_3NHxIJ9PReHg7SQuGeZsawUBarOXcwhwrk9GF0VpnwjBKLGV8IVUcu-ASFGSZtVwooaOoNGhNWdZPrra-sfNbB6HN6zIUUFWmAdeFPCYltZaCyv_RTHDCJaEkohd76Mp1vomLRIppqeJQOlJ4SxXeheDB5mtf1sZ_5ATnmxvl-zeKJZc7YxNitNbH3MvwUxenpEQoFbl0ywWzhF_N__L9AtkFms0</recordid><startdate>20130501</startdate><enddate>20130501</enddate><creator>Malas, Tareq</creator><creator>Ahmadia, Aron J.</creator><creator>Brown, Jed</creator><creator>Gunnels, John A.</creator><creator>Keyes, David E.</creator><general>SAGE Publications</general><general>Sage Publications</general><general>SAGE PUBLICATIONS, INC</general><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>FR3</scope><scope>P64</scope><scope>RC3</scope></search><sort><creationdate>20130501</creationdate><title>Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor</title><author>Malas, Tareq ; Ahmadia, Aron J. ; Brown, Jed ; Gunnels, John A. ; Keyes, David E.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c405t-a64e7f097bfeb08a32da99936a421f245d78435c57e8e33ff56869a4289e99243</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2013</creationdate><topic>Applied sciences</topic><topic>Assembly</topic><topic>Cache</topic><topic>Central processing units</topic><topic>Computation</topic><topic>Computer science; control theory; systems</topic><topic>Computer systems and distributed systems. User interface</topic><topic>Construction</topic><topic>CPUs</topic><topic>Energy efficiency</topic><topic>Exact sciences and technology</topic><topic>Integrated circuits</topic><topic>Kernels</topic><topic>Mathematical models</topic><topic>Microprocessors</topic><topic>Optimization</topic><topic>Optimization techniques</topic><topic>Partial differential equations</topic><topic>Programming languages</topic><topic>Simulation</topic><topic>Software</topic><topic>Studies</topic><topic>Three dimensional</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Malas, Tareq</creatorcontrib><creatorcontrib>Ahmadia, Aron J.</creatorcontrib><creatorcontrib>Brown, Jed</creatorcontrib><creatorcontrib>Gunnels, John A.</creatorcontrib><creatorcontrib>Keyes, David E.</creatorcontrib><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Engineering Research Database</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Genetics Abstracts</collection><jtitle>The international journal of high performance computing applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Malas, Tareq</au><au>Ahmadia, Aron J.</au><au>Brown, Jed</au><au>Gunnels, John A.</au><au>Keyes, David E.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor</atitle><jtitle>The international journal of high performance computing applications</jtitle><date>2013-05-01</date><risdate>2013</risdate><volume>27</volume><issue>2</issue><spage>193</spage><epage>209</epage><pages>193-209</pages><issn>1094-3420</issn><eissn>1741-2846</eissn><abstract>Several emerging petascale architectures use energy-efficient processors with vectorized computational units and in-order thread processing. On these architectures the sustained performance of streaming numerical kernels, ubiquitous in the solution of partial differential equations, represents a challenge despite the regularity of memory access. Sophisticated optimization techniques are required to fully utilize the CPU. We propose a new method for constructing streaming numerical kernels using a high-level assembly synthesis and optimization framework. We describe an implementation of this method in Python targeting the IBM® Blue Gene®/P supercomputer’s PowerPC® 450 core. This paper details the high-level design, construction, simulation, verification, and analysis of these kernels utilizing a subset of the CPU’s instruction set. We demonstrate the effectiveness of our approach by implementing several three-dimensional stencil kernels over a variety of cached memory scenarios and analyzing the mechanically scheduled variants, including a 27-point stencil achieving a 1.7 × speedup over the best previously published results.</abstract><cop>London, England</cop><pub>SAGE Publications</pub><doi>10.1177/1094342012444795</doi><tpages>17</tpages></addata></record>
fulltext fulltext
identifier ISSN: 1094-3420
ispartof The international journal of high performance computing applications, 2013-05, Vol.27 (2), p.193-209
issn 1094-3420
1741-2846
language eng
recordid cdi_proquest_miscellaneous_1777997627
source SAGE Journals; Alma/SFX Local Collection
subjects Applied sciences
Assembly
Cache
Central processing units
Computation
Computer science
control theory
systems
Computer systems and distributed systems. User interface
Construction
CPUs
Energy efficiency
Exact sciences and technology
Integrated circuits
Kernels
Mathematical models
Microprocessors
Optimization
Optimization techniques
Partial differential equations
Programming languages
Simulation
Software
Studies
Three dimensional
title Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-19T00%3A43%3A13IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Optimizing%20the%20performance%20of%20streaming%20numerical%20kernels%20on%20the%20IBM%20Blue%20Gene/P%20PowerPC%20450%20processor&rft.jtitle=The%20international%20journal%20of%20high%20performance%20computing%20applications&rft.au=Malas,%20Tareq&rft.date=2013-05-01&rft.volume=27&rft.issue=2&rft.spage=193&rft.epage=209&rft.pages=193-209&rft.issn=1094-3420&rft.eissn=1741-2846&rft_id=info:doi/10.1177/1094342012444795&rft_dat=%3Cproquest_cross%3E1365157121%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1349788439&rft_id=info:pmid/&rft_sage_id=10.1177_1094342012444795&rfr_iscdi=true