Analyzing the Performance Portability of Tensor Decomposition

We employ pressure point analysis and roofline modeling to identify performance bottlenecks and determine an upper bound on the performance of the Canonical Polyadic Alternating Poisson Regression Multiplicative Update (CP-APR MU) algorithm in the SparTen software library. Our analyses reveal that a...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2023-07
Hauptverfasser: Geronimo Anderson, S Isaac, Teranishi, Keita, Dunlavy, Daniel M, Choi, Jee
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Geronimo Anderson, S Isaac
Teranishi, Keita
Dunlavy, Daniel M
Choi, Jee
description We employ pressure point analysis and roofline modeling to identify performance bottlenecks and determine an upper bound on the performance of the Canonical Polyadic Alternating Poisson Regression Multiplicative Update (CP-APR MU) algorithm in the SparTen software library. Our analyses reveal that a particular matrix computation, \(\Phi^{(n)}\), is the critical performance bottleneck in the SparTen CP-APR MU implementation. Moreover, we find that atomic operations are not a critical bottleneck while higher cache reuse can provide a non-trivial performance improvement. We also utilize grid search on the Kokkos library parallel policy parameters to achieve 2.25x average speedup over the SparTen default for \(\Phi^{(n)}\) computation on CPU and 1.70x on GPU. We conclude our investigations by comparing Kokkos implementations of the STREAM benchmark and the matricized tensor times Khatri-Rao product (MTTKRP) benchmark from the Parallel Sparse Tensor Algorithm (PASTA) benchmark suite to implementations using vendor libraries. We show that with a single implementation Kokkos achieves performance comparable to hand-tuned code for fundamental operations that make up tensor decomposition kernels on a wide range of CPU and GPU systems. Overall, we conclude that Kokkos demonstrates good performance portability for simple data-intensive operations but requires tuning for algorithms with more complex dependencies and data access patterns.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2835324201</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2835324201</sourcerecordid><originalsourceid>FETCH-proquest_journals_28353242013</originalsourceid><addsrcrecordid>eNqNjbsKwjAUQIMgWLT_EHAupDeNdnEQHzg6dC-xpJqS5tbcdKhfbwc_wOmc4cBZsASkzLOyAFixlKgTQsBuD0rJhB2OXrvpY_2Tx5fhdxNaDL32zewYon5YZ-PEseWV8YSBn02D_YBko0W_YctWOzLpj2u2vV6q0y0bAr5HQ7HucAzzgWoopZJQgMjlf9UXsvw4Xg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2835324201</pqid></control><display><type>article</type><title>Analyzing the Performance Portability of Tensor Decomposition</title><source>Free E- Journals</source><creator>Geronimo Anderson, S Isaac ; Teranishi, Keita ; Dunlavy, Daniel M ; Choi, Jee</creator><creatorcontrib>Geronimo Anderson, S Isaac ; Teranishi, Keita ; Dunlavy, Daniel M ; Choi, Jee</creatorcontrib><description>We employ pressure point analysis and roofline modeling to identify performance bottlenecks and determine an upper bound on the performance of the Canonical Polyadic Alternating Poisson Regression Multiplicative Update (CP-APR MU) algorithm in the SparTen software library. Our analyses reveal that a particular matrix computation, \(\Phi^{(n)}\), is the critical performance bottleneck in the SparTen CP-APR MU implementation. Moreover, we find that atomic operations are not a critical bottleneck while higher cache reuse can provide a non-trivial performance improvement. We also utilize grid search on the Kokkos library parallel policy parameters to achieve 2.25x average speedup over the SparTen default for \(\Phi^{(n)}\) computation on CPU and 1.70x on GPU. We conclude our investigations by comparing Kokkos implementations of the STREAM benchmark and the matricized tensor times Khatri-Rao product (MTTKRP) benchmark from the Parallel Sparse Tensor Algorithm (PASTA) benchmark suite to implementations using vendor libraries. We show that with a single implementation Kokkos achieves performance comparable to hand-tuned code for fundamental operations that make up tensor decomposition kernels on a wide range of CPU and GPU systems. Overall, we conclude that Kokkos demonstrates good performance portability for simple data-intensive operations but requires tuning for algorithms with more complex dependencies and data access patterns.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Benchmarks ; Computation ; Decomposition ; Libraries ; Portability ; Tensors ; Upper bounds</subject><ispartof>arXiv.org, 2023-07</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Geronimo Anderson, S Isaac</creatorcontrib><creatorcontrib>Teranishi, Keita</creatorcontrib><creatorcontrib>Dunlavy, Daniel M</creatorcontrib><creatorcontrib>Choi, Jee</creatorcontrib><title>Analyzing the Performance Portability of Tensor Decomposition</title><title>arXiv.org</title><description>We employ pressure point analysis and roofline modeling to identify performance bottlenecks and determine an upper bound on the performance of the Canonical Polyadic Alternating Poisson Regression Multiplicative Update (CP-APR MU) algorithm in the SparTen software library. Our analyses reveal that a particular matrix computation, \(\Phi^{(n)}\), is the critical performance bottleneck in the SparTen CP-APR MU implementation. Moreover, we find that atomic operations are not a critical bottleneck while higher cache reuse can provide a non-trivial performance improvement. We also utilize grid search on the Kokkos library parallel policy parameters to achieve 2.25x average speedup over the SparTen default for \(\Phi^{(n)}\) computation on CPU and 1.70x on GPU. We conclude our investigations by comparing Kokkos implementations of the STREAM benchmark and the matricized tensor times Khatri-Rao product (MTTKRP) benchmark from the Parallel Sparse Tensor Algorithm (PASTA) benchmark suite to implementations using vendor libraries. We show that with a single implementation Kokkos achieves performance comparable to hand-tuned code for fundamental operations that make up tensor decomposition kernels on a wide range of CPU and GPU systems. Overall, we conclude that Kokkos demonstrates good performance portability for simple data-intensive operations but requires tuning for algorithms with more complex dependencies and data access patterns.</description><subject>Algorithms</subject><subject>Benchmarks</subject><subject>Computation</subject><subject>Decomposition</subject><subject>Libraries</subject><subject>Portability</subject><subject>Tensors</subject><subject>Upper bounds</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNjbsKwjAUQIMgWLT_EHAupDeNdnEQHzg6dC-xpJqS5tbcdKhfbwc_wOmc4cBZsASkzLOyAFixlKgTQsBuD0rJhB2OXrvpY_2Tx5fhdxNaDL32zewYon5YZ-PEseWV8YSBn02D_YBko0W_YctWOzLpj2u2vV6q0y0bAr5HQ7HucAzzgWoopZJQgMjlf9UXsvw4Xg</recordid><startdate>20230706</startdate><enddate>20230706</enddate><creator>Geronimo Anderson, S Isaac</creator><creator>Teranishi, Keita</creator><creator>Dunlavy, Daniel M</creator><creator>Choi, Jee</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20230706</creationdate><title>Analyzing the Performance Portability of Tensor Decomposition</title><author>Geronimo Anderson, S Isaac ; Teranishi, Keita ; Dunlavy, Daniel M ; Choi, Jee</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28353242013</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Algorithms</topic><topic>Benchmarks</topic><topic>Computation</topic><topic>Decomposition</topic><topic>Libraries</topic><topic>Portability</topic><topic>Tensors</topic><topic>Upper bounds</topic><toplevel>online_resources</toplevel><creatorcontrib>Geronimo Anderson, S Isaac</creatorcontrib><creatorcontrib>Teranishi, Keita</creatorcontrib><creatorcontrib>Dunlavy, Daniel M</creatorcontrib><creatorcontrib>Choi, Jee</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Geronimo Anderson, S Isaac</au><au>Teranishi, Keita</au><au>Dunlavy, Daniel M</au><au>Choi, Jee</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Analyzing the Performance Portability of Tensor Decomposition</atitle><jtitle>arXiv.org</jtitle><date>2023-07-06</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>We employ pressure point analysis and roofline modeling to identify performance bottlenecks and determine an upper bound on the performance of the Canonical Polyadic Alternating Poisson Regression Multiplicative Update (CP-APR MU) algorithm in the SparTen software library. Our analyses reveal that a particular matrix computation, \(\Phi^{(n)}\), is the critical performance bottleneck in the SparTen CP-APR MU implementation. Moreover, we find that atomic operations are not a critical bottleneck while higher cache reuse can provide a non-trivial performance improvement. We also utilize grid search on the Kokkos library parallel policy parameters to achieve 2.25x average speedup over the SparTen default for \(\Phi^{(n)}\) computation on CPU and 1.70x on GPU. We conclude our investigations by comparing Kokkos implementations of the STREAM benchmark and the matricized tensor times Khatri-Rao product (MTTKRP) benchmark from the Parallel Sparse Tensor Algorithm (PASTA) benchmark suite to implementations using vendor libraries. We show that with a single implementation Kokkos achieves performance comparable to hand-tuned code for fundamental operations that make up tensor decomposition kernels on a wide range of CPU and GPU systems. Overall, we conclude that Kokkos demonstrates good performance portability for simple data-intensive operations but requires tuning for algorithms with more complex dependencies and data access patterns.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2023-07
issn 2331-8422
language eng
recordid cdi_proquest_journals_2835324201
source Free E- Journals
subjects Algorithms
Benchmarks
Computation
Decomposition
Libraries
Portability
Tensors
Upper bounds
title Analyzing the Performance Portability of Tensor Decomposition
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-13T22%3A54%3A26IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Analyzing%20the%20Performance%20Portability%20of%20Tensor%20Decomposition&rft.jtitle=arXiv.org&rft.au=Geronimo%20Anderson,%20S%20Isaac&rft.date=2023-07-06&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2835324201%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2835324201&rft_id=info:pmid/&rfr_iscdi=true