Batched matrix computations on hardware accelerators based on GPUs

Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-effic...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	The international journal of high performance computing applications 2015-06, Vol.29 (2), p.193
Hauptverfasser:	Haidar, Azzam, Dong, Tingxing, Luszczek, Piotr, Tomov, Stanimire, Dongarra, Jack
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms batched factorization Computer peripherals hardware accelerators High performance computing Integrated circuits MATHEMATICS AND COMPUTING numerical linear algebra numerical software libraries one-sided factorization algorithms Problem solving Studies
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue	2
container_start_page	193
container_title	The international journal of high performance computing applications
container_volume	29
creator	Haidar, Azzam Dong, Tingxing Luszczek, Piotr Tomov, Stanimire Dongarra, Jack
description	Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that the authors call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms the authors present together with their implementations are, by design, inherently parallel. Their approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases.
doi_str_mv	10.1177/1094342014567546
format	Article
fullrecord	<record><control><sourceid>proquest_osti_</sourceid><recordid>TN_cdi_osti_scitechconnect_1361289</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3687444591</sourcerecordid><originalsourceid>FETCH-LOGICAL-o252t-1c79c6492455d05a39b2e2764329e57df4c4edc78022d0ec91679309a46433883</originalsourceid><addsrcrecordid>eNotjk1LAzEURYMoWKt7l0HXo_nOZGmLVqGgC7se0pdXOqWd1CSD_nwjdXUfnPMul5Bbzh44t_aRM6ekEowrbaxW5oxMuFW8Ea0y5_WuuPnjl-Qq5x1jzCipJ2Q28wW2GOjBl9T_UIiH41h86eOQaRzo1qfw7RNSD4B7TL7ElOna5_pS8eJjla_JxcbvM97855SsXp4_56_N8n3xNn9aNlFoURoO1oFRTiitA9NeurVAYesM4VDbsFGgMIBtmRCBIThurJPMeVUV2bZySu5OvTGXvsvQF4QtxGFAKB2XhovWVen-JB1T_Boxl24XxzTUXR03bVWkFEz-Au58Vjk</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1681283320</pqid></control><display><type>article</type><title>Batched matrix computations on hardware accelerators based on GPUs</title><source>Access via SAGE</source><source>Alma/SFX Local Collection</source><creator>Haidar, Azzam ; Dong, Tingxing ; Luszczek, Piotr ; Tomov, Stanimire ; Dongarra, Jack</creator><creatorcontrib>Haidar, Azzam ; Dong, Tingxing ; Luszczek, Piotr ; Tomov, Stanimire ; Dongarra, Jack ; Univ. of Tennessee, Knoxville, TN (United States) ; Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)</creatorcontrib><description>Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that the authors call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms the authors present together with their implementations are, by design, inherently parallel. Their approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases.</description><identifier>ISSN: 1094-3420</identifier><identifier>EISSN: 1741-2846</identifier><identifier>DOI: 10.1177/1094342014567546</identifier><language>eng</language><publisher>London: SAGE PUBLICATIONS, INC</publisher><subject>Algorithms ; batched factorization ; Computer peripherals ; hardware accelerators ; High performance computing ; Integrated circuits ; MATHEMATICS AND COMPUTING ; numerical linear algebra ; numerical software libraries ; one-sided factorization algorithms ; Problem solving ; Studies</subject><ispartof>The international journal of high performance computing applications, 2015-06, Vol.29 (2), p.193</ispartof><rights>Copyright SAGE PUBLICATIONS, INC. May 2015</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,314,780,784,885,27924,27925</link.rule.ids><backlink>$$Uhttps://www.osti.gov/servlets/purl/1361289$$D View this record in Osti.gov$$Hfree_for_read</backlink></links><search><creatorcontrib>Haidar, Azzam</creatorcontrib><creatorcontrib>Dong, Tingxing</creatorcontrib><creatorcontrib>Luszczek, Piotr</creatorcontrib><creatorcontrib>Tomov, Stanimire</creatorcontrib><creatorcontrib>Dongarra, Jack</creatorcontrib><creatorcontrib>Univ. of Tennessee, Knoxville, TN (United States)</creatorcontrib><creatorcontrib>Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)</creatorcontrib><title>Batched matrix computations on hardware accelerators based on GPUs</title><title>The international journal of high performance computing applications</title><description>Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that the authors call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms the authors present together with their implementations are, by design, inherently parallel. Their approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases.</description><subject>Algorithms</subject><subject>batched factorization</subject><subject>Computer peripherals</subject><subject>hardware accelerators</subject><subject>High performance computing</subject><subject>Integrated circuits</subject><subject>MATHEMATICS AND COMPUTING</subject><subject>numerical linear algebra</subject><subject>numerical software libraries</subject><subject>one-sided factorization algorithms</subject><subject>Problem solving</subject><subject>Studies</subject><issn>1094-3420</issn><issn>1741-2846</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2015</creationdate><recordtype>article</recordtype><recordid>eNotjk1LAzEURYMoWKt7l0HXo_nOZGmLVqGgC7se0pdXOqWd1CSD_nwjdXUfnPMul5Bbzh44t_aRM6ekEowrbaxW5oxMuFW8Ea0y5_WuuPnjl-Qq5x1jzCipJ2Q28wW2GOjBl9T_UIiH41h86eOQaRzo1qfw7RNSD4B7TL7ElOna5_pS8eJjla_JxcbvM97855SsXp4_56_N8n3xNn9aNlFoURoO1oFRTiitA9NeurVAYesM4VDbsFGgMIBtmRCBIThurJPMeVUV2bZySu5OvTGXvsvQF4QtxGFAKB2XhovWVen-JB1T_Boxl24XxzTUXR03bVWkFEz-Au58Vjk</recordid><startdate>20150601</startdate><enddate>20150601</enddate><creator>Haidar, Azzam</creator><creator>Dong, Tingxing</creator><creator>Luszczek, Piotr</creator><creator>Tomov, Stanimire</creator><creator>Dongarra, Jack</creator><general>SAGE PUBLICATIONS, INC</general><general>SAGE</general><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>OIOZB</scope><scope>OTOTI</scope></search><sort><creationdate>20150601</creationdate><title>Batched matrix computations on hardware accelerators based on GPUs</title><author>Haidar, Azzam ; Dong, Tingxing ; Luszczek, Piotr ; Tomov, Stanimire ; Dongarra, Jack</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-o252t-1c79c6492455d05a39b2e2764329e57df4c4edc78022d0ec91679309a46433883</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2015</creationdate><topic>Algorithms</topic><topic>batched factorization</topic><topic>Computer peripherals</topic><topic>hardware accelerators</topic><topic>High performance computing</topic><topic>Integrated circuits</topic><topic>MATHEMATICS AND COMPUTING</topic><topic>numerical linear algebra</topic><topic>numerical software libraries</topic><topic>one-sided factorization algorithms</topic><topic>Problem solving</topic><topic>Studies</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Haidar, Azzam</creatorcontrib><creatorcontrib>Dong, Tingxing</creatorcontrib><creatorcontrib>Luszczek, Piotr</creatorcontrib><creatorcontrib>Tomov, Stanimire</creatorcontrib><creatorcontrib>Dongarra, Jack</creatorcontrib><creatorcontrib>Univ. of Tennessee, Knoxville, TN (United States)</creatorcontrib><creatorcontrib>Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)</creatorcontrib><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>OSTI.GOV - Hybrid</collection><collection>OSTI.GOV</collection><jtitle>The international journal of high performance computing applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Haidar, Azzam</au><au>Dong, Tingxing</au><au>Luszczek, Piotr</au><au>Tomov, Stanimire</au><au>Dongarra, Jack</au><aucorp>Univ. of Tennessee, Knoxville, TN (United States)</aucorp><aucorp>Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)</aucorp><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Batched matrix computations on hardware accelerators based on GPUs</atitle><jtitle>The international journal of high performance computing applications</jtitle><date>2015-06-01</date><risdate>2015</risdate><volume>29</volume><issue>2</issue><spage>193</spage><pages>193-</pages><issn>1094-3420</issn><eissn>1741-2846</eissn><abstract>Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that the authors call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms the authors present together with their implementations are, by design, inherently parallel. Their approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases.</abstract><cop>London</cop><pub>SAGE PUBLICATIONS, INC</pub><doi>10.1177/1094342014567546</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1094-3420
ispartof	The international journal of high performance computing applications, 2015-06, Vol.29 (2), p.193
issn	1094-3420 1741-2846
language	eng
recordid	cdi_osti_scitechconnect_1361289
source	Access via SAGE; Alma/SFX Local Collection
subjects	Algorithms batched factorization Computer peripherals hardware accelerators High performance computing Integrated circuits MATHEMATICS AND COMPUTING numerical linear algebra numerical software libraries one-sided factorization algorithms Problem solving Studies
title	Batched matrix computations on hardware accelerators based on GPUs
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-20T10%3A20%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_osti_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Batched%20matrix%20computations%20on%20hardware%20accelerators%20based%20on%20GPUs&rft.jtitle=The%20international%20journal%20of%20high%20performance%20computing%20applications&rft.au=Haidar,%20Azzam&rft.aucorp=Univ.%20of%20Tennessee,%20Knoxville,%20TN%20(United%20States)&rft.date=2015-06-01&rft.volume=29&rft.issue=2&rft.spage=193&rft.pages=193-&rft.issn=1094-3420&rft.eissn=1741-2846&rft_id=info:doi/10.1177/1094342014567546&rft_dat=%3Cproquest_osti_%3E3687444591%3C/proquest_osti_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1681283320&rft_id=info:pmid/&rfr_iscdi=true