Batched matrix computations on hardware accelerators based on GPUs

Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-effic...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:The international journal of high performance computing applications 2015-06, Vol.29 (2), p.193
Hauptverfasser: Haidar, Azzam, Dong, Tingxing, Luszczek, Piotr, Tomov, Stanimire, Dongarra, Jack
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue 2
container_start_page 193
container_title The international journal of high performance computing applications
container_volume 29
creator Haidar, Azzam
Dong, Tingxing
Luszczek, Piotr
Tomov, Stanimire
Dongarra, Jack
description Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that the authors call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms the authors present together with their implementations are, by design, inherently parallel. Their approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases.
doi_str_mv 10.1177/1094342014567546
format Article
fullrecord <record><control><sourceid>proquest_osti_</sourceid><recordid>TN_cdi_osti_scitechconnect_1361289</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3687444591</sourcerecordid><originalsourceid>FETCH-LOGICAL-o252t-1c79c6492455d05a39b2e2764329e57df4c4edc78022d0ec91679309a46433883</originalsourceid><addsrcrecordid>eNotjk1LAzEURYMoWKt7l0HXo_nOZGmLVqGgC7se0pdXOqWd1CSD_nwjdXUfnPMul5Bbzh44t_aRM6ekEowrbaxW5oxMuFW8Ea0y5_WuuPnjl-Qq5x1jzCipJ2Q28wW2GOjBl9T_UIiH41h86eOQaRzo1qfw7RNSD4B7TL7ElOna5_pS8eJjla_JxcbvM97855SsXp4_56_N8n3xNn9aNlFoURoO1oFRTiitA9NeurVAYesM4VDbsFGgMIBtmRCBIThurJPMeVUV2bZySu5OvTGXvsvQF4QtxGFAKB2XhovWVen-JB1T_Boxl24XxzTUXR03bVWkFEz-Au58Vjk</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1681283320</pqid></control><display><type>article</type><title>Batched matrix computations on hardware accelerators based on GPUs</title><source>Access via SAGE</source><source>Alma/SFX Local Collection</source><creator>Haidar, Azzam ; Dong, Tingxing ; Luszczek, Piotr ; Tomov, Stanimire ; Dongarra, Jack</creator><creatorcontrib>Haidar, Azzam ; Dong, Tingxing ; Luszczek, Piotr ; Tomov, Stanimire ; Dongarra, Jack ; Univ. of Tennessee, Knoxville, TN (United States) ; Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)</creatorcontrib><description>Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that the authors call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms the authors present together with their implementations are, by design, inherently parallel. Their approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases.</description><identifier>ISSN: 1094-3420</identifier><identifier>EISSN: 1741-2846</identifier><identifier>DOI: 10.1177/1094342014567546</identifier><language>eng</language><publisher>London: SAGE PUBLICATIONS, INC</publisher><subject>Algorithms ; batched factorization ; Computer peripherals ; hardware accelerators ; High performance computing ; Integrated circuits ; MATHEMATICS AND COMPUTING ; numerical linear algebra ; numerical software libraries ; one-sided factorization algorithms ; Problem solving ; Studies</subject><ispartof>The international journal of high performance computing applications, 2015-06, Vol.29 (2), p.193</ispartof><rights>Copyright SAGE PUBLICATIONS, INC. May 2015</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,314,780,784,885,27924,27925</link.rule.ids><backlink>$$Uhttps://www.osti.gov/servlets/purl/1361289$$D View this record in Osti.gov$$Hfree_for_read</backlink></links><search><creatorcontrib>Haidar, Azzam</creatorcontrib><creatorcontrib>Dong, Tingxing</creatorcontrib><creatorcontrib>Luszczek, Piotr</creatorcontrib><creatorcontrib>Tomov, Stanimire</creatorcontrib><creatorcontrib>Dongarra, Jack</creatorcontrib><creatorcontrib>Univ. of Tennessee, Knoxville, TN (United States)</creatorcontrib><creatorcontrib>Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)</creatorcontrib><title>Batched matrix computations on hardware accelerators based on GPUs</title><title>The international journal of high performance computing applications</title><description>Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that the authors call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms the authors present together with their implementations are, by design, inherently parallel. Their approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases.</description><subject>Algorithms</subject><subject>batched factorization</subject><subject>Computer peripherals</subject><subject>hardware accelerators</subject><subject>High performance computing</subject><subject>Integrated circuits</subject><subject>MATHEMATICS AND COMPUTING</subject><subject>numerical linear algebra</subject><subject>numerical software libraries</subject><subject>one-sided factorization algorithms</subject><subject>Problem solving</subject><subject>Studies</subject><issn>1094-3420</issn><issn>1741-2846</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2015</creationdate><recordtype>article</recordtype><recordid>eNotjk1LAzEURYMoWKt7l0HXo_nOZGmLVqGgC7se0pdXOqWd1CSD_nwjdXUfnPMul5Bbzh44t_aRM6ekEowrbaxW5oxMuFW8Ea0y5_WuuPnjl-Qq5x1jzCipJ2Q28wW2GOjBl9T_UIiH41h86eOQaRzo1qfw7RNSD4B7TL7ElOna5_pS8eJjla_JxcbvM97855SsXp4_56_N8n3xNn9aNlFoURoO1oFRTiitA9NeurVAYesM4VDbsFGgMIBtmRCBIThurJPMeVUV2bZySu5OvTGXvsvQF4QtxGFAKB2XhovWVen-JB1T_Boxl24XxzTUXR03bVWkFEz-Au58Vjk</recordid><startdate>20150601</startdate><enddate>20150601</enddate><creator>Haidar, Azzam</creator><creator>Dong, Tingxing</creator><creator>Luszczek, Piotr</creator><creator>Tomov, Stanimire</creator><creator>Dongarra, Jack</creator><general>SAGE PUBLICATIONS, INC</general><general>SAGE</general><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>OIOZB</scope><scope>OTOTI</scope></search><sort><creationdate>20150601</creationdate><title>Batched matrix computations on hardware accelerators based on GPUs</title><author>Haidar, Azzam ; Dong, Tingxing ; Luszczek, Piotr ; Tomov, Stanimire ; Dongarra, Jack</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-o252t-1c79c6492455d05a39b2e2764329e57df4c4edc78022d0ec91679309a46433883</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2015</creationdate><topic>Algorithms</topic><topic>batched factorization</topic><topic>Computer peripherals</topic><topic>hardware accelerators</topic><topic>High performance computing</topic><topic>Integrated circuits</topic><topic>MATHEMATICS AND COMPUTING</topic><topic>numerical linear algebra</topic><topic>numerical software libraries</topic><topic>one-sided factorization algorithms</topic><topic>Problem solving</topic><topic>Studies</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Haidar, Azzam</creatorcontrib><creatorcontrib>Dong, Tingxing</creatorcontrib><creatorcontrib>Luszczek, Piotr</creatorcontrib><creatorcontrib>Tomov, Stanimire</creatorcontrib><creatorcontrib>Dongarra, Jack</creatorcontrib><creatorcontrib>Univ. of Tennessee, Knoxville, TN (United States)</creatorcontrib><creatorcontrib>Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)</creatorcontrib><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>OSTI.GOV - Hybrid</collection><collection>OSTI.GOV</collection><jtitle>The international journal of high performance computing applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Haidar, Azzam</au><au>Dong, Tingxing</au><au>Luszczek, Piotr</au><au>Tomov, Stanimire</au><au>Dongarra, Jack</au><aucorp>Univ. of Tennessee, Knoxville, TN (United States)</aucorp><aucorp>Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)</aucorp><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Batched matrix computations on hardware accelerators based on GPUs</atitle><jtitle>The international journal of high performance computing applications</jtitle><date>2015-06-01</date><risdate>2015</risdate><volume>29</volume><issue>2</issue><spage>193</spage><pages>193-</pages><issn>1094-3420</issn><eissn>1741-2846</eissn><abstract>Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that the authors call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms the authors present together with their implementations are, by design, inherently parallel. Their approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases.</abstract><cop>London</cop><pub>SAGE PUBLICATIONS, INC</pub><doi>10.1177/1094342014567546</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1094-3420
ispartof The international journal of high performance computing applications, 2015-06, Vol.29 (2), p.193
issn 1094-3420
1741-2846
language eng
recordid cdi_osti_scitechconnect_1361289
source Access via SAGE; Alma/SFX Local Collection
subjects Algorithms
batched factorization
Computer peripherals
hardware accelerators
High performance computing
Integrated circuits
MATHEMATICS AND COMPUTING
numerical linear algebra
numerical software libraries
one-sided factorization algorithms
Problem solving
Studies
title Batched matrix computations on hardware accelerators based on GPUs
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-20T10%3A20%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_osti_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Batched%20matrix%20computations%20on%20hardware%20accelerators%20based%20on%20GPUs&rft.jtitle=The%20international%20journal%20of%20high%20performance%20computing%20applications&rft.au=Haidar,%20Azzam&rft.aucorp=Univ.%20of%20Tennessee,%20Knoxville,%20TN%20(United%20States)&rft.date=2015-06-01&rft.volume=29&rft.issue=2&rft.spage=193&rft.pages=193-&rft.issn=1094-3420&rft.eissn=1741-2846&rft_id=info:doi/10.1177/1094342014567546&rft_dat=%3Cproquest_osti_%3E3687444591%3C/proquest_osti_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1681283320&rft_id=info:pmid/&rfr_iscdi=true