NonGEMM Bench: Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads

Machine Learning (ML) operators are the building blocks to design ML models with various target applications. GEneral Matrix Multiplication (GEMM) operators are the backbone of ML models. They are notorious for being computationally expensive requiring billions of multiply-and-accumulate. Therefore,...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Karami, Rachid, Moar, Chakshu, Kao, Sheng-Chun, Kwon, Hyoukjun
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Hardware Architecture Computer Science - Learning Computer Science - Performance
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Karami, Rachid Moar, Chakshu Kao, Sheng-Chun Kwon, Hyoukjun
description	Machine Learning (ML) operators are the building blocks to design ML models with various target applications. GEneral Matrix Multiplication (GEMM) operators are the backbone of ML models. They are notorious for being computationally expensive requiring billions of multiply-and-accumulate. Therefore, significant effort has been put to study and optimize the GEMM operators in order to speed up the execution of ML models. GPUs and accelerators are widely deployed to accelerate ML workloads by optimizing the execution of GEMM operators. Nonetheless, the performance of NonGEMM operators have not been studied as thoroughly as GEMMs. Therefore, this paper describes \bench, a benchmark to study NonGEMM operators. We first construct \bench using popular ML workloads from different domains, then perform case studies on various grade GPU platforms to analyze the behavior of NonGEMM operators in GPU accelerated systems. Finally, we present some key takeaways to bridge the gap between GEMM and NonGEMM operators and to offer the community with potential new optimization directions.
doi_str_mv	10.48550/arxiv.2404.11788
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2404_11788</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2404_11788</sourcerecordid><originalsourceid>FETCH-LOGICAL-a678-5c05866b6d0b678c394041383b470d752b3a6fa058372ffeebcda016b6347a943</originalsourceid><addsrcrecordid>eNo9j81OwzAQhH3hgAoPwKn7AglO7MQuN6hKi5RAD0Uco3VsE4vWRk7E39M3hJ_TaGd2R_sRcpHRlMuioJcYP9xbmnPK0ywTUp4Sex_8elXXcGN8213Bo9cm9gN67fwzDJ2BrYk2xAP61sAmRPcVPAQ7RRUOph-gruApxJd9QN3Duxs6-Cv9t8_IicV9b85_dUZ2t6vdcpNUD-u75XWVYClkUrS0kGWpSk3VOLdsMX6aMckUF1SLIlcMS4vjEhO5tcaoViPNxgPGBS44m5H5T-0E2rxGd8D42XwDNxMwOwIA_U_3</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>NonGEMM Bench: Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads</title><source>arXiv.org</source><creator>Karami, Rachid ; Moar, Chakshu ; Kao, Sheng-Chun ; Kwon, Hyoukjun</creator><creatorcontrib>Karami, Rachid ; Moar, Chakshu ; Kao, Sheng-Chun ; Kwon, Hyoukjun</creatorcontrib><description>Machine Learning (ML) operators are the building blocks to design ML models with various target applications. GEneral Matrix Multiplication (GEMM) operators are the backbone of ML models. They are notorious for being computationally expensive requiring billions of multiply-and-accumulate. Therefore, significant effort has been put to study and optimize the GEMM operators in order to speed up the execution of ML models. GPUs and accelerators are widely deployed to accelerate ML workloads by optimizing the execution of GEMM operators. Nonetheless, the performance of NonGEMM operators have not been studied as thoroughly as GEMMs. Therefore, this paper describes \bench, a benchmark to study NonGEMM operators. We first construct \bench using popular ML workloads from different domains, then perform case studies on various grade GPU platforms to analyze the behavior of NonGEMM operators in GPU accelerated systems. Finally, we present some key takeaways to bridge the gap between GEMM and NonGEMM operators and to offer the community with potential new optimization directions.</description><identifier>DOI: 10.48550/arxiv.2404.11788</identifier><language>eng</language><subject>Computer Science - Hardware Architecture ; Computer Science - Learning ; Computer Science - Performance</subject><creationdate>2024-04</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2404.11788$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2404.11788$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Karami, Rachid</creatorcontrib><creatorcontrib>Moar, Chakshu</creatorcontrib><creatorcontrib>Kao, Sheng-Chun</creatorcontrib><creatorcontrib>Kwon, Hyoukjun</creatorcontrib><title>NonGEMM Bench: Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads</title><description>Machine Learning (ML) operators are the building blocks to design ML models with various target applications. GEneral Matrix Multiplication (GEMM) operators are the backbone of ML models. They are notorious for being computationally expensive requiring billions of multiply-and-accumulate. Therefore, significant effort has been put to study and optimize the GEMM operators in order to speed up the execution of ML models. GPUs and accelerators are widely deployed to accelerate ML workloads by optimizing the execution of GEMM operators. Nonetheless, the performance of NonGEMM operators have not been studied as thoroughly as GEMMs. Therefore, this paper describes \bench, a benchmark to study NonGEMM operators. We first construct \bench using popular ML workloads from different domains, then perform case studies on various grade GPU platforms to analyze the behavior of NonGEMM operators in GPU accelerated systems. Finally, we present some key takeaways to bridge the gap between GEMM and NonGEMM operators and to offer the community with potential new optimization directions.</description><subject>Computer Science - Hardware Architecture</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Performance</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNo9j81OwzAQhH3hgAoPwKn7AglO7MQuN6hKi5RAD0Uco3VsE4vWRk7E39M3hJ_TaGd2R_sRcpHRlMuioJcYP9xbmnPK0ywTUp4Sex_8elXXcGN8213Bo9cm9gN67fwzDJ2BrYk2xAP61sAmRPcVPAQ7RRUOph-gruApxJd9QN3Duxs6-Cv9t8_IicV9b85_dUZ2t6vdcpNUD-u75XWVYClkUrS0kGWpSk3VOLdsMX6aMckUF1SLIlcMS4vjEhO5tcaoViPNxgPGBS44m5H5T-0E2rxGd8D42XwDNxMwOwIA_U_3</recordid><startdate>20240417</startdate><enddate>20240417</enddate><creator>Karami, Rachid</creator><creator>Moar, Chakshu</creator><creator>Kao, Sheng-Chun</creator><creator>Kwon, Hyoukjun</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240417</creationdate><title>NonGEMM Bench: Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads</title><author>Karami, Rachid ; Moar, Chakshu ; Kao, Sheng-Chun ; Kwon, Hyoukjun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a678-5c05866b6d0b678c394041383b470d752b3a6fa058372ffeebcda016b6347a943</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Hardware Architecture</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Performance</topic><toplevel>online_resources</toplevel><creatorcontrib>Karami, Rachid</creatorcontrib><creatorcontrib>Moar, Chakshu</creatorcontrib><creatorcontrib>Kao, Sheng-Chun</creatorcontrib><creatorcontrib>Kwon, Hyoukjun</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Karami, Rachid</au><au>Moar, Chakshu</au><au>Kao, Sheng-Chun</au><au>Kwon, Hyoukjun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>NonGEMM Bench: Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads</atitle><date>2024-04-17</date><risdate>2024</risdate><abstract>Machine Learning (ML) operators are the building blocks to design ML models with various target applications. GEneral Matrix Multiplication (GEMM) operators are the backbone of ML models. They are notorious for being computationally expensive requiring billions of multiply-and-accumulate. Therefore, significant effort has been put to study and optimize the GEMM operators in order to speed up the execution of ML models. GPUs and accelerators are widely deployed to accelerate ML workloads by optimizing the execution of GEMM operators. Nonetheless, the performance of NonGEMM operators have not been studied as thoroughly as GEMMs. Therefore, this paper describes \bench, a benchmark to study NonGEMM operators. We first construct \bench using popular ML workloads from different domains, then perform case studies on various grade GPU platforms to analyze the behavior of NonGEMM operators in GPU accelerated systems. Finally, we present some key takeaways to bridge the gap between GEMM and NonGEMM operators and to offer the community with potential new optimization directions.</abstract><doi>10.48550/arxiv.2404.11788</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2404.11788
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2404_11788
source	arXiv.org
subjects	Computer Science - Hardware Architecture Computer Science - Learning Computer Science - Performance
title	NonGEMM Bench: Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T10%3A35%3A26IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=NonGEMM%20Bench:%20Understanding%20the%20Performance%20Horizon%20of%20the%20Latest%20ML%20Workloads%20with%20NonGEMM%20Workloads&rft.au=Karami,%20Rachid&rft.date=2024-04-17&rft_id=info:doi/10.48550/arxiv.2404.11788&rft_dat=%3Carxiv_GOX%3E2404_11788%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true