NonGEMM Bench: Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads
Machine Learning (ML) operators are the building blocks to design ML models with various target applications. GEneral Matrix Multiplication (GEMM) operators are the backbone of ML models. They are notorious for being computationally expensive requiring billions of multiply-and-accumulate. Therefore,...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Karami, Rachid Moar, Chakshu Kao, Sheng-Chun Kwon, Hyoukjun |
description | Machine Learning (ML) operators are the building blocks to design ML models
with various target applications. GEneral Matrix Multiplication (GEMM)
operators are the backbone of ML models. They are notorious for being
computationally expensive requiring billions of multiply-and-accumulate.
Therefore, significant effort has been put to study and optimize the GEMM
operators in order to speed up the execution of ML models. GPUs and
accelerators are widely deployed to accelerate ML workloads by optimizing the
execution of GEMM operators. Nonetheless, the performance of NonGEMM operators
have not been studied as thoroughly as GEMMs. Therefore, this paper describes
\bench, a benchmark to study NonGEMM operators. We first construct \bench using
popular ML workloads from different domains, then perform case studies on
various grade GPU platforms to analyze the behavior of NonGEMM operators in GPU
accelerated systems. Finally, we present some key takeaways to bridge the gap
between GEMM and NonGEMM operators and to offer the community with potential
new optimization directions. |
doi_str_mv | 10.48550/arxiv.2404.11788 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2404_11788</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2404_11788</sourcerecordid><originalsourceid>FETCH-LOGICAL-a678-5c05866b6d0b678c394041383b470d752b3a6fa058372ffeebcda016b6347a943</originalsourceid><addsrcrecordid>eNo9j81OwzAQhH3hgAoPwKn7AglO7MQuN6hKi5RAD0Uco3VsE4vWRk7E39M3hJ_TaGd2R_sRcpHRlMuioJcYP9xbmnPK0ywTUp4Sex_8elXXcGN8213Bo9cm9gN67fwzDJ2BrYk2xAP61sAmRPcVPAQ7RRUOph-gruApxJd9QN3Duxs6-Cv9t8_IicV9b85_dUZ2t6vdcpNUD-u75XWVYClkUrS0kGWpSk3VOLdsMX6aMckUF1SLIlcMS4vjEhO5tcaoViPNxgPGBS44m5H5T-0E2rxGd8D42XwDNxMwOwIA_U_3</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>NonGEMM Bench: Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads</title><source>arXiv.org</source><creator>Karami, Rachid ; Moar, Chakshu ; Kao, Sheng-Chun ; Kwon, Hyoukjun</creator><creatorcontrib>Karami, Rachid ; Moar, Chakshu ; Kao, Sheng-Chun ; Kwon, Hyoukjun</creatorcontrib><description>Machine Learning (ML) operators are the building blocks to design ML models
with various target applications. GEneral Matrix Multiplication (GEMM)
operators are the backbone of ML models. They are notorious for being
computationally expensive requiring billions of multiply-and-accumulate.
Therefore, significant effort has been put to study and optimize the GEMM
operators in order to speed up the execution of ML models. GPUs and
accelerators are widely deployed to accelerate ML workloads by optimizing the
execution of GEMM operators. Nonetheless, the performance of NonGEMM operators
have not been studied as thoroughly as GEMMs. Therefore, this paper describes
\bench, a benchmark to study NonGEMM operators. We first construct \bench using
popular ML workloads from different domains, then perform case studies on
various grade GPU platforms to analyze the behavior of NonGEMM operators in GPU
accelerated systems. Finally, we present some key takeaways to bridge the gap
between GEMM and NonGEMM operators and to offer the community with potential
new optimization directions.</description><identifier>DOI: 10.48550/arxiv.2404.11788</identifier><language>eng</language><subject>Computer Science - Hardware Architecture ; Computer Science - Learning ; Computer Science - Performance</subject><creationdate>2024-04</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2404.11788$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2404.11788$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Karami, Rachid</creatorcontrib><creatorcontrib>Moar, Chakshu</creatorcontrib><creatorcontrib>Kao, Sheng-Chun</creatorcontrib><creatorcontrib>Kwon, Hyoukjun</creatorcontrib><title>NonGEMM Bench: Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads</title><description>Machine Learning (ML) operators are the building blocks to design ML models
with various target applications. GEneral Matrix Multiplication (GEMM)
operators are the backbone of ML models. They are notorious for being
computationally expensive requiring billions of multiply-and-accumulate.
Therefore, significant effort has been put to study and optimize the GEMM
operators in order to speed up the execution of ML models. GPUs and
accelerators are widely deployed to accelerate ML workloads by optimizing the
execution of GEMM operators. Nonetheless, the performance of NonGEMM operators
have not been studied as thoroughly as GEMMs. Therefore, this paper describes
\bench, a benchmark to study NonGEMM operators. We first construct \bench using
popular ML workloads from different domains, then perform case studies on
various grade GPU platforms to analyze the behavior of NonGEMM operators in GPU
accelerated systems. Finally, we present some key takeaways to bridge the gap
between GEMM and NonGEMM operators and to offer the community with potential
new optimization directions.</description><subject>Computer Science - Hardware Architecture</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Performance</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNo9j81OwzAQhH3hgAoPwKn7AglO7MQuN6hKi5RAD0Uco3VsE4vWRk7E39M3hJ_TaGd2R_sRcpHRlMuioJcYP9xbmnPK0ywTUp4Sex_8elXXcGN8213Bo9cm9gN67fwzDJ2BrYk2xAP61sAmRPcVPAQ7RRUOph-gruApxJd9QN3Duxs6-Cv9t8_IicV9b85_dUZ2t6vdcpNUD-u75XWVYClkUrS0kGWpSk3VOLdsMX6aMckUF1SLIlcMS4vjEhO5tcaoViPNxgPGBS44m5H5T-0E2rxGd8D42XwDNxMwOwIA_U_3</recordid><startdate>20240417</startdate><enddate>20240417</enddate><creator>Karami, Rachid</creator><creator>Moar, Chakshu</creator><creator>Kao, Sheng-Chun</creator><creator>Kwon, Hyoukjun</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240417</creationdate><title>NonGEMM Bench: Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads</title><author>Karami, Rachid ; Moar, Chakshu ; Kao, Sheng-Chun ; Kwon, Hyoukjun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a678-5c05866b6d0b678c394041383b470d752b3a6fa058372ffeebcda016b6347a943</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Hardware Architecture</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Performance</topic><toplevel>online_resources</toplevel><creatorcontrib>Karami, Rachid</creatorcontrib><creatorcontrib>Moar, Chakshu</creatorcontrib><creatorcontrib>Kao, Sheng-Chun</creatorcontrib><creatorcontrib>Kwon, Hyoukjun</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Karami, Rachid</au><au>Moar, Chakshu</au><au>Kao, Sheng-Chun</au><au>Kwon, Hyoukjun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>NonGEMM Bench: Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads</atitle><date>2024-04-17</date><risdate>2024</risdate><abstract>Machine Learning (ML) operators are the building blocks to design ML models
with various target applications. GEneral Matrix Multiplication (GEMM)
operators are the backbone of ML models. They are notorious for being
computationally expensive requiring billions of multiply-and-accumulate.
Therefore, significant effort has been put to study and optimize the GEMM
operators in order to speed up the execution of ML models. GPUs and
accelerators are widely deployed to accelerate ML workloads by optimizing the
execution of GEMM operators. Nonetheless, the performance of NonGEMM operators
have not been studied as thoroughly as GEMMs. Therefore, this paper describes
\bench, a benchmark to study NonGEMM operators. We first construct \bench using
popular ML workloads from different domains, then perform case studies on
various grade GPU platforms to analyze the behavior of NonGEMM operators in GPU
accelerated systems. Finally, we present some key takeaways to bridge the gap
between GEMM and NonGEMM operators and to offer the community with potential
new optimization directions.</abstract><doi>10.48550/arxiv.2404.11788</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2404.11788 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2404_11788 |
source | arXiv.org |
subjects | Computer Science - Hardware Architecture Computer Science - Learning Computer Science - Performance |
title | NonGEMM Bench: Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T10%3A35%3A26IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=NonGEMM%20Bench:%20Understanding%20the%20Performance%20Horizon%20of%20the%20Latest%20ML%20Workloads%20with%20NonGEMM%20Workloads&rft.au=Karami,%20Rachid&rft.date=2024-04-17&rft_id=info:doi/10.48550/arxiv.2404.11788&rft_dat=%3Carxiv_GOX%3E2404_11788%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |