A resource-aware workload scheduling method for unbalanced GEMMs on GPUs

GEMM (General Matrix Multiplication) serves as a fundamental operator for deep learning computations. Especially in attention-based deep learning models, such as Bert, GPT, and SAM, the sizes of matrices involved in GEMMs exhibit an unbalanced distribution due to the variable input, resulting in the...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Computer journal 2024-10
Hauptverfasser: Liu, Hangda, Diao, Boyu, Chen, Wenxin, Xu, Yongjun
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title Computer journal
container_volume
creator Liu, Hangda
Diao, Boyu
Chen, Wenxin
Xu, Yongjun
description GEMM (General Matrix Multiplication) serves as a fundamental operator for deep learning computations. Especially in attention-based deep learning models, such as Bert, GPT, and SAM, the sizes of matrices involved in GEMMs exhibit an unbalanced distribution due to the variable input, resulting in the low utilization of hardware resources. To address the issue, this paper proposes inserting a novel GEMM processing layer into the deep learning inference stack and using an adaptive load balancing method to partition and schedule GEMM computation tasks. The method is implemented with hardware runtime resource information, such as the occupancy of computing units, etc. Experiment results show the remarkable performance of our method in unbalanced input GEMM scenarios, achieving an average performance improvement of 2.3x. The method also performs well in attention-based models (GPT-2 and SAM), achieving an average inference speed improvement of 1.1x. These findings highlight the effectiveness of resource-aware algorithm optimization, especially for computation task scheduling.
doi_str_mv 10.1093/comjnl/bxae110
format Article
fullrecord <record><control><sourceid>crossref</sourceid><recordid>TN_cdi_crossref_primary_10_1093_comjnl_bxae110</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_1093_comjnl_bxae110</sourcerecordid><originalsourceid>FETCH-LOGICAL-c124t-4f4e864aae4adc730eddca0cc177fa703d032c4ee13fd95e89843e80837e0d7b3</originalsourceid><addsrcrecordid>eNot0EFPwjAYgOHGaOJEr577Bwpf17puR0JwmED0IOflW_tVwG01LQv67w2B03t7Dw9jzxKmEio1s6E_DN2s_UWSEm5YJnUBIofC3LIMQILQRQ737CGlAwDkUBUZW815pBTGaEngCSPxU4jfXUDHk92RG7v98MV7Ou6C4z5EPg4tdjhYcrxebjaJh4HXH9v0yO48domerp2w7evyc7ES6_f6bTFfCytzfRTaayoLjUganTUKyDmLYK00xqMB5UDlVhNJ5V31QmVVakUllMoQONOqCZtevjaGlCL55ifue4x_jYTm7NBcHJqrg_oHnRtUAw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>A resource-aware workload scheduling method for unbalanced GEMMs on GPUs</title><source>Oxford University Press Journals All Titles (1996-Current)</source><creator>Liu, Hangda ; Diao, Boyu ; Chen, Wenxin ; Xu, Yongjun</creator><creatorcontrib>Liu, Hangda ; Diao, Boyu ; Chen, Wenxin ; Xu, Yongjun</creatorcontrib><description>GEMM (General Matrix Multiplication) serves as a fundamental operator for deep learning computations. Especially in attention-based deep learning models, such as Bert, GPT, and SAM, the sizes of matrices involved in GEMMs exhibit an unbalanced distribution due to the variable input, resulting in the low utilization of hardware resources. To address the issue, this paper proposes inserting a novel GEMM processing layer into the deep learning inference stack and using an adaptive load balancing method to partition and schedule GEMM computation tasks. The method is implemented with hardware runtime resource information, such as the occupancy of computing units, etc. Experiment results show the remarkable performance of our method in unbalanced input GEMM scenarios, achieving an average performance improvement of 2.3x. The method also performs well in attention-based models (GPT-2 and SAM), achieving an average inference speed improvement of 1.1x. These findings highlight the effectiveness of resource-aware algorithm optimization, especially for computation task scheduling.</description><identifier>ISSN: 0010-4620</identifier><identifier>EISSN: 1460-2067</identifier><identifier>DOI: 10.1093/comjnl/bxae110</identifier><language>eng</language><ispartof>Computer journal, 2024-10</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c124t-4f4e864aae4adc730eddca0cc177fa703d032c4ee13fd95e89843e80837e0d7b3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Liu, Hangda</creatorcontrib><creatorcontrib>Diao, Boyu</creatorcontrib><creatorcontrib>Chen, Wenxin</creatorcontrib><creatorcontrib>Xu, Yongjun</creatorcontrib><title>A resource-aware workload scheduling method for unbalanced GEMMs on GPUs</title><title>Computer journal</title><description>GEMM (General Matrix Multiplication) serves as a fundamental operator for deep learning computations. Especially in attention-based deep learning models, such as Bert, GPT, and SAM, the sizes of matrices involved in GEMMs exhibit an unbalanced distribution due to the variable input, resulting in the low utilization of hardware resources. To address the issue, this paper proposes inserting a novel GEMM processing layer into the deep learning inference stack and using an adaptive load balancing method to partition and schedule GEMM computation tasks. The method is implemented with hardware runtime resource information, such as the occupancy of computing units, etc. Experiment results show the remarkable performance of our method in unbalanced input GEMM scenarios, achieving an average performance improvement of 2.3x. The method also performs well in attention-based models (GPT-2 and SAM), achieving an average inference speed improvement of 1.1x. These findings highlight the effectiveness of resource-aware algorithm optimization, especially for computation task scheduling.</description><issn>0010-4620</issn><issn>1460-2067</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNot0EFPwjAYgOHGaOJEr577Bwpf17puR0JwmED0IOflW_tVwG01LQv67w2B03t7Dw9jzxKmEio1s6E_DN2s_UWSEm5YJnUBIofC3LIMQILQRQ737CGlAwDkUBUZW815pBTGaEngCSPxU4jfXUDHk92RG7v98MV7Ou6C4z5EPg4tdjhYcrxebjaJh4HXH9v0yO48domerp2w7evyc7ES6_f6bTFfCytzfRTaayoLjUganTUKyDmLYK00xqMB5UDlVhNJ5V31QmVVakUllMoQONOqCZtevjaGlCL55ifue4x_jYTm7NBcHJqrg_oHnRtUAw</recordid><startdate>20241025</startdate><enddate>20241025</enddate><creator>Liu, Hangda</creator><creator>Diao, Boyu</creator><creator>Chen, Wenxin</creator><creator>Xu, Yongjun</creator><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20241025</creationdate><title>A resource-aware workload scheduling method for unbalanced GEMMs on GPUs</title><author>Liu, Hangda ; Diao, Boyu ; Chen, Wenxin ; Xu, Yongjun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c124t-4f4e864aae4adc730eddca0cc177fa703d032c4ee13fd95e89843e80837e0d7b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Liu, Hangda</creatorcontrib><creatorcontrib>Diao, Boyu</creatorcontrib><creatorcontrib>Chen, Wenxin</creatorcontrib><creatorcontrib>Xu, Yongjun</creatorcontrib><collection>CrossRef</collection><jtitle>Computer journal</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Liu, Hangda</au><au>Diao, Boyu</au><au>Chen, Wenxin</au><au>Xu, Yongjun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A resource-aware workload scheduling method for unbalanced GEMMs on GPUs</atitle><jtitle>Computer journal</jtitle><date>2024-10-25</date><risdate>2024</risdate><issn>0010-4620</issn><eissn>1460-2067</eissn><abstract>GEMM (General Matrix Multiplication) serves as a fundamental operator for deep learning computations. Especially in attention-based deep learning models, such as Bert, GPT, and SAM, the sizes of matrices involved in GEMMs exhibit an unbalanced distribution due to the variable input, resulting in the low utilization of hardware resources. To address the issue, this paper proposes inserting a novel GEMM processing layer into the deep learning inference stack and using an adaptive load balancing method to partition and schedule GEMM computation tasks. The method is implemented with hardware runtime resource information, such as the occupancy of computing units, etc. Experiment results show the remarkable performance of our method in unbalanced input GEMM scenarios, achieving an average performance improvement of 2.3x. The method also performs well in attention-based models (GPT-2 and SAM), achieving an average inference speed improvement of 1.1x. These findings highlight the effectiveness of resource-aware algorithm optimization, especially for computation task scheduling.</abstract><doi>10.1093/comjnl/bxae110</doi></addata></record>
fulltext fulltext
identifier ISSN: 0010-4620
ispartof Computer journal, 2024-10
issn 0010-4620
1460-2067
language eng
recordid cdi_crossref_primary_10_1093_comjnl_bxae110
source Oxford University Press Journals All Titles (1996-Current)
title A resource-aware workload scheduling method for unbalanced GEMMs on GPUs
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-29T22%3A48%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20resource-aware%20workload%20scheduling%20method%20for%20unbalanced%20GEMMs%20on%20GPUs&rft.jtitle=Computer%20journal&rft.au=Liu,%20Hangda&rft.date=2024-10-25&rft.issn=0010-4620&rft.eissn=1460-2067&rft_id=info:doi/10.1093/comjnl/bxae110&rft_dat=%3Ccrossref%3E10_1093_comjnl_bxae110%3C/crossref%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true