A resource-aware workload scheduling method for unbalanced GEMMs on GPUs

GEMM (General Matrix Multiplication) serves as a fundamental operator for deep learning computations. Especially in attention-based deep learning models, such as Bert, GPT, and SAM, the sizes of matrices involved in GEMMs exhibit an unbalanced distribution due to the variable input, resulting in the...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Computer journal 2024-10
Hauptverfasser:	Liu, Hangda, Diao, Boyu, Chen, Wenxin, Xu, Yongjun
Format:	Artikel
Sprache:	eng
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	Computer journal
container_volume
creator	Liu, Hangda Diao, Boyu Chen, Wenxin Xu, Yongjun
description	GEMM (General Matrix Multiplication) serves as a fundamental operator for deep learning computations. Especially in attention-based deep learning models, such as Bert, GPT, and SAM, the sizes of matrices involved in GEMMs exhibit an unbalanced distribution due to the variable input, resulting in the low utilization of hardware resources. To address the issue, this paper proposes inserting a novel GEMM processing layer into the deep learning inference stack and using an adaptive load balancing method to partition and schedule GEMM computation tasks. The method is implemented with hardware runtime resource information, such as the occupancy of computing units, etc. Experiment results show the remarkable performance of our method in unbalanced input GEMM scenarios, achieving an average performance improvement of 2.3x. The method also performs well in attention-based models (GPT-2 and SAM), achieving an average inference speed improvement of 1.1x. These findings highlight the effectiveness of resource-aware algorithm optimization, especially for computation task scheduling.
doi_str_mv	10.1093/comjnl/bxae110
format	Article
fullrecord	<record><control><sourceid>crossref</sourceid><recordid>TN_cdi_crossref_primary_10_1093_comjnl_bxae110</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_1093_comjnl_bxae110</sourcerecordid><originalsourceid>FETCH-LOGICAL-c124t-4f4e864aae4adc730eddca0cc177fa703d032c4ee13fd95e89843e80837e0d7b3</originalsourceid><addsrcrecordid>eNot0EFPwjAYgOHGaOJEr577Bwpf17puR0JwmED0IOflW_tVwG01LQv67w2B03t7Dw9jzxKmEio1s6E_DN2s_UWSEm5YJnUBIofC3LIMQILQRQ737CGlAwDkUBUZW815pBTGaEngCSPxU4jfXUDHk92RG7v98MV7Ou6C4z5EPg4tdjhYcrxebjaJh4HXH9v0yO48domerp2w7evyc7ES6_f6bTFfCytzfRTaayoLjUganTUKyDmLYK00xqMB5UDlVhNJ5V31QmVVakUllMoQONOqCZtevjaGlCL55ifue4x_jYTm7NBcHJqrg_oHnRtUAw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>A resource-aware workload scheduling method for unbalanced GEMMs on GPUs</title><source>Oxford University Press Journals All Titles (1996-Current)</source><creator>Liu, Hangda ; Diao, Boyu ; Chen, Wenxin ; Xu, Yongjun</creator><creatorcontrib>Liu, Hangda ; Diao, Boyu ; Chen, Wenxin ; Xu, Yongjun</creatorcontrib><description>GEMM (General Matrix Multiplication) serves as a fundamental operator for deep learning computations. Especially in attention-based deep learning models, such as Bert, GPT, and SAM, the sizes of matrices involved in GEMMs exhibit an unbalanced distribution due to the variable input, resulting in the low utilization of hardware resources. To address the issue, this paper proposes inserting a novel GEMM processing layer into the deep learning inference stack and using an adaptive load balancing method to partition and schedule GEMM computation tasks. The method is implemented with hardware runtime resource information, such as the occupancy of computing units, etc. Experiment results show the remarkable performance of our method in unbalanced input GEMM scenarios, achieving an average performance improvement of 2.3x. The method also performs well in attention-based models (GPT-2 and SAM), achieving an average inference speed improvement of 1.1x. These findings highlight the effectiveness of resource-aware algorithm optimization, especially for computation task scheduling.</description><identifier>ISSN: 0010-4620</identifier><identifier>EISSN: 1460-2067</identifier><identifier>DOI: 10.1093/comjnl/bxae110</identifier><language>eng</language><ispartof>Computer journal, 2024-10</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c124t-4f4e864aae4adc730eddca0cc177fa703d032c4ee13fd95e89843e80837e0d7b3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Liu, Hangda</creatorcontrib><creatorcontrib>Diao, Boyu</creatorcontrib><creatorcontrib>Chen, Wenxin</creatorcontrib><creatorcontrib>Xu, Yongjun</creatorcontrib><title>A resource-aware workload scheduling method for unbalanced GEMMs on GPUs</title><title>Computer journal</title><description>GEMM (General Matrix Multiplication) serves as a fundamental operator for deep learning computations. Especially in attention-based deep learning models, such as Bert, GPT, and SAM, the sizes of matrices involved in GEMMs exhibit an unbalanced distribution due to the variable input, resulting in the low utilization of hardware resources. To address the issue, this paper proposes inserting a novel GEMM processing layer into the deep learning inference stack and using an adaptive load balancing method to partition and schedule GEMM computation tasks. The method is implemented with hardware runtime resource information, such as the occupancy of computing units, etc. Experiment results show the remarkable performance of our method in unbalanced input GEMM scenarios, achieving an average performance improvement of 2.3x. The method also performs well in attention-based models (GPT-2 and SAM), achieving an average inference speed improvement of 1.1x. These findings highlight the effectiveness of resource-aware algorithm optimization, especially for computation task scheduling.</description><issn>0010-4620</issn><issn>1460-2067</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNot0EFPwjAYgOHGaOJEr577Bwpf17puR0JwmED0IOflW_tVwG01LQv67w2B03t7Dw9jzxKmEio1s6E_DN2s_UWSEm5YJnUBIofC3LIMQILQRQ737CGlAwDkUBUZW815pBTGaEngCSPxU4jfXUDHk92RG7v98MV7Ou6C4z5EPg4tdjhYcrxebjaJh4HXH9v0yO48domerp2w7evyc7ES6_f6bTFfCytzfRTaayoLjUganTUKyDmLYK00xqMB5UDlVhNJ5V31QmVVakUllMoQONOqCZtevjaGlCL55ifue4x_jYTm7NBcHJqrg_oHnRtUAw</recordid><startdate>20241025</startdate><enddate>20241025</enddate><creator>Liu, Hangda</creator><creator>Diao, Boyu</creator><creator>Chen, Wenxin</creator><creator>Xu, Yongjun</creator><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20241025</creationdate><title>A resource-aware workload scheduling method for unbalanced GEMMs on GPUs</title><author>Liu, Hangda ; Diao, Boyu ; Chen, Wenxin ; Xu, Yongjun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c124t-4f4e864aae4adc730eddca0cc177fa703d032c4ee13fd95e89843e80837e0d7b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Liu, Hangda</creatorcontrib><creatorcontrib>Diao, Boyu</creatorcontrib><creatorcontrib>Chen, Wenxin</creatorcontrib><creatorcontrib>Xu, Yongjun</creatorcontrib><collection>CrossRef</collection><jtitle>Computer journal</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Liu, Hangda</au><au>Diao, Boyu</au><au>Chen, Wenxin</au><au>Xu, Yongjun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A resource-aware workload scheduling method for unbalanced GEMMs on GPUs</atitle><jtitle>Computer journal</jtitle><date>2024-10-25</date><risdate>2024</risdate><issn>0010-4620</issn><eissn>1460-2067</eissn><abstract>GEMM (General Matrix Multiplication) serves as a fundamental operator for deep learning computations. Especially in attention-based deep learning models, such as Bert, GPT, and SAM, the sizes of matrices involved in GEMMs exhibit an unbalanced distribution due to the variable input, resulting in the low utilization of hardware resources. To address the issue, this paper proposes inserting a novel GEMM processing layer into the deep learning inference stack and using an adaptive load balancing method to partition and schedule GEMM computation tasks. The method is implemented with hardware runtime resource information, such as the occupancy of computing units, etc. Experiment results show the remarkable performance of our method in unbalanced input GEMM scenarios, achieving an average performance improvement of 2.3x. The method also performs well in attention-based models (GPT-2 and SAM), achieving an average inference speed improvement of 1.1x. These findings highlight the effectiveness of resource-aware algorithm optimization, especially for computation task scheduling.</abstract><doi>10.1093/comjnl/bxae110</doi></addata></record>
fulltext	fulltext
identifier	ISSN: 0010-4620
ispartof	Computer journal, 2024-10
issn	0010-4620 1460-2067
language	eng
recordid	cdi_crossref_primary_10_1093_comjnl_bxae110
source	Oxford University Press Journals All Titles (1996-Current)
title	A resource-aware workload scheduling method for unbalanced GEMMs on GPUs
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-29T22%3A48%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20resource-aware%20workload%20scheduling%20method%20for%20unbalanced%20GEMMs%20on%20GPUs&rft.jtitle=Computer%20journal&rft.au=Liu,%20Hangda&rft.date=2024-10-25&rft.issn=0010-4620&rft.eissn=1460-2067&rft_id=info:doi/10.1093/comjnl/bxae110&rft_dat=%3Ccrossref%3E10_1093_comjnl_bxae110%3C/crossref%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true