SCAR: Scheduling Multi-Model AI Workloads on Heterogeneous Multi-Chiplet Module Accelerators

Emerging multi-model workloads with heavy models like recent large language models significantly increased the compute and memory demands on hardware. To address such increasing demands, designing a scalable hardware architecture became a key problem. Among recent solutions, the 2.5D silicon interpo...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Odema, Mohanad, Chen, Luke, Kwon, Hyoukjun, Faruque, Mohammad Abdullah Al
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Distributed, Parallel, and Cluster Computing Computer Science - Hardware Architecture Computer Science - Learning Computer Science - Performance
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Odema, Mohanad Chen, Luke Kwon, Hyoukjun Faruque, Mohammad Abdullah Al
description	Emerging multi-model workloads with heavy models like recent large language models significantly increased the compute and memory demands on hardware. To address such increasing demands, designing a scalable hardware architecture became a key problem. Among recent solutions, the 2.5D silicon interposer multi-chip module (MCM)-based AI accelerator has been actively explored as a promising scalable solution due to their significant benefits in the low engineering cost and composability. However, previous MCM accelerators are based on homogeneous architectures with fixed dataflow, which encounter major challenges from highly heterogeneous multi-model workloads due to their limited workload adaptivity. Therefore, in this work, we explore the opportunity in the heterogeneous dataflow MCM AI accelerators. We identify the scheduling of multi-model workload on heterogeneous dataflow MCM AI accelerator is an important and challenging problem due to its significance and scale, which reaches O(10^56) even for a two-model workload on 6x6 chiplets. We develop a set of heuristics to navigate the huge scheduling space and codify them into a scheduler, SCAR, with advanced techniques such as inter-chiplet pipelining. Our evaluation on ten multi-model workload scenarios for datacenter multitenancy and AR/VR use-cases has shown the efficacy of our approach, achieving on average 27.6% and 29.6% less energy-delay product (EDP) for the respective applications settings compared to homogeneous baselines.
doi_str_mv	10.48550/arxiv.2405.00790
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2405_00790</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2405_00790</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-84d13195232d0417785fd583dde3dc80e653fab00cb6a73fef13f960bc85c20f3</originalsourceid><addsrcrecordid>eNotz0FLwzAcBfBcPMj0A3gyX6D1n6ZpUm-lqBtsCG7gRShp8s8WjM1IW9Fvr86d3uW9Bz9CbhjkpRIC7nT68p95UYLIAWQNl-Rt2zYv93RrDmjn4Ic93cxh8tkmWgy0WdHXmN5D1HakcaBLnDDFPQ4Y5_HcbA_-GHCiv4s5IG2MwYBJTzGNV-TC6TDi9TkXZPf4sGuX2fr5adU260xXEjJVWsZZLQpeWCiZlEo4KxS3Frk1CrAS3OkewPSVltyhY9zVFfRGCVOA4wty-3974nXH5D90-u7-mN2JyX8ArW1NPg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>SCAR: Scheduling Multi-Model AI Workloads on Heterogeneous Multi-Chiplet Module Accelerators</title><source>arXiv.org</source><creator>Odema, Mohanad ; Chen, Luke ; Kwon, Hyoukjun ; Faruque, Mohammad Abdullah Al</creator><creatorcontrib>Odema, Mohanad ; Chen, Luke ; Kwon, Hyoukjun ; Faruque, Mohammad Abdullah Al</creatorcontrib><description>Emerging multi-model workloads with heavy models like recent large language models significantly increased the compute and memory demands on hardware. To address such increasing demands, designing a scalable hardware architecture became a key problem. Among recent solutions, the 2.5D silicon interposer multi-chip module (MCM)-based AI accelerator has been actively explored as a promising scalable solution due to their significant benefits in the low engineering cost and composability. However, previous MCM accelerators are based on homogeneous architectures with fixed dataflow, which encounter major challenges from highly heterogeneous multi-model workloads due to their limited workload adaptivity. Therefore, in this work, we explore the opportunity in the heterogeneous dataflow MCM AI accelerators. We identify the scheduling of multi-model workload on heterogeneous dataflow MCM AI accelerator is an important and challenging problem due to its significance and scale, which reaches O(10^56) even for a two-model workload on 6x6 chiplets. We develop a set of heuristics to navigate the huge scheduling space and codify them into a scheduler, SCAR, with advanced techniques such as inter-chiplet pipelining. Our evaluation on ten multi-model workload scenarios for datacenter multitenancy and AR/VR use-cases has shown the efficacy of our approach, achieving on average 27.6% and 29.6% less energy-delay product (EDP) for the respective applications settings compared to homogeneous baselines.</description><identifier>DOI: 10.48550/arxiv.2405.00790</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Distributed, Parallel, and Cluster Computing ; Computer Science - Hardware Architecture ; Computer Science - Learning ; Computer Science - Performance</subject><creationdate>2024-05</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2405.00790$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2405.00790$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Odema, Mohanad</creatorcontrib><creatorcontrib>Chen, Luke</creatorcontrib><creatorcontrib>Kwon, Hyoukjun</creatorcontrib><creatorcontrib>Faruque, Mohammad Abdullah Al</creatorcontrib><title>SCAR: Scheduling Multi-Model AI Workloads on Heterogeneous Multi-Chiplet Module Accelerators</title><description>Emerging multi-model workloads with heavy models like recent large language models significantly increased the compute and memory demands on hardware. To address such increasing demands, designing a scalable hardware architecture became a key problem. Among recent solutions, the 2.5D silicon interposer multi-chip module (MCM)-based AI accelerator has been actively explored as a promising scalable solution due to their significant benefits in the low engineering cost and composability. However, previous MCM accelerators are based on homogeneous architectures with fixed dataflow, which encounter major challenges from highly heterogeneous multi-model workloads due to their limited workload adaptivity. Therefore, in this work, we explore the opportunity in the heterogeneous dataflow MCM AI accelerators. We identify the scheduling of multi-model workload on heterogeneous dataflow MCM AI accelerator is an important and challenging problem due to its significance and scale, which reaches O(10^56) even for a two-model workload on 6x6 chiplets. We develop a set of heuristics to navigate the huge scheduling space and codify them into a scheduler, SCAR, with advanced techniques such as inter-chiplet pipelining. Our evaluation on ten multi-model workload scenarios for datacenter multitenancy and AR/VR use-cases has shown the efficacy of our approach, achieving on average 27.6% and 29.6% less energy-delay product (EDP) for the respective applications settings compared to homogeneous baselines.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><subject>Computer Science - Hardware Architecture</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Performance</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz0FLwzAcBfBcPMj0A3gyX6D1n6ZpUm-lqBtsCG7gRShp8s8WjM1IW9Fvr86d3uW9Bz9CbhjkpRIC7nT68p95UYLIAWQNl-Rt2zYv93RrDmjn4Ic93cxh8tkmWgy0WdHXmN5D1HakcaBLnDDFPQ4Y5_HcbA_-GHCiv4s5IG2MwYBJTzGNV-TC6TDi9TkXZPf4sGuX2fr5adU260xXEjJVWsZZLQpeWCiZlEo4KxS3Frk1CrAS3OkewPSVltyhY9zVFfRGCVOA4wty-3974nXH5D90-u7-mN2JyX8ArW1NPg</recordid><startdate>20240501</startdate><enddate>20240501</enddate><creator>Odema, Mohanad</creator><creator>Chen, Luke</creator><creator>Kwon, Hyoukjun</creator><creator>Faruque, Mohammad Abdullah Al</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240501</creationdate><title>SCAR: Scheduling Multi-Model AI Workloads on Heterogeneous Multi-Chiplet Module Accelerators</title><author>Odema, Mohanad ; Chen, Luke ; Kwon, Hyoukjun ; Faruque, Mohammad Abdullah Al</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-84d13195232d0417785fd583dde3dc80e653fab00cb6a73fef13f960bc85c20f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Distributed, Parallel, and Cluster Computing</topic><topic>Computer Science - Hardware Architecture</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Performance</topic><toplevel>online_resources</toplevel><creatorcontrib>Odema, Mohanad</creatorcontrib><creatorcontrib>Chen, Luke</creatorcontrib><creatorcontrib>Kwon, Hyoukjun</creatorcontrib><creatorcontrib>Faruque, Mohammad Abdullah Al</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Odema, Mohanad</au><au>Chen, Luke</au><au>Kwon, Hyoukjun</au><au>Faruque, Mohammad Abdullah Al</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SCAR: Scheduling Multi-Model AI Workloads on Heterogeneous Multi-Chiplet Module Accelerators</atitle><date>2024-05-01</date><risdate>2024</risdate><abstract>Emerging multi-model workloads with heavy models like recent large language models significantly increased the compute and memory demands on hardware. To address such increasing demands, designing a scalable hardware architecture became a key problem. Among recent solutions, the 2.5D silicon interposer multi-chip module (MCM)-based AI accelerator has been actively explored as a promising scalable solution due to their significant benefits in the low engineering cost and composability. However, previous MCM accelerators are based on homogeneous architectures with fixed dataflow, which encounter major challenges from highly heterogeneous multi-model workloads due to their limited workload adaptivity. Therefore, in this work, we explore the opportunity in the heterogeneous dataflow MCM AI accelerators. We identify the scheduling of multi-model workload on heterogeneous dataflow MCM AI accelerator is an important and challenging problem due to its significance and scale, which reaches O(10^56) even for a two-model workload on 6x6 chiplets. We develop a set of heuristics to navigate the huge scheduling space and codify them into a scheduler, SCAR, with advanced techniques such as inter-chiplet pipelining. Our evaluation on ten multi-model workload scenarios for datacenter multitenancy and AR/VR use-cases has shown the efficacy of our approach, achieving on average 27.6% and 29.6% less energy-delay product (EDP) for the respective applications settings compared to homogeneous baselines.</abstract><doi>10.48550/arxiv.2405.00790</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2405.00790
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2405_00790
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Distributed, Parallel, and Cluster Computing Computer Science - Hardware Architecture Computer Science - Learning Computer Science - Performance
title	SCAR: Scheduling Multi-Model AI Workloads on Heterogeneous Multi-Chiplet Module Accelerators
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T10%3A33%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SCAR:%20Scheduling%20Multi-Model%20AI%20Workloads%20on%20Heterogeneous%20Multi-Chiplet%20Module%20Accelerators&rft.au=Odema,%20Mohanad&rft.date=2024-05-01&rft_id=info:doi/10.48550/arxiv.2405.00790&rft_dat=%3Carxiv_GOX%3E2405_00790%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true