SCAR: Scheduling Multi-Model AI Workloads on Heterogeneous Multi-Chiplet Module Accelerators

Emerging multi-model workloads with heavy models like recent large language models significantly increased the compute and memory demands on hardware. To address such increasing demands, designing a scalable hardware architecture became a key problem. Among recent solutions, the 2.5D silicon interpo...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Odema, Mohanad, Chen, Luke, Kwon, Hyoukjun, Faruque, Mohammad Abdullah Al
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Odema, Mohanad
Chen, Luke
Kwon, Hyoukjun
Faruque, Mohammad Abdullah Al
description Emerging multi-model workloads with heavy models like recent large language models significantly increased the compute and memory demands on hardware. To address such increasing demands, designing a scalable hardware architecture became a key problem. Among recent solutions, the 2.5D silicon interposer multi-chip module (MCM)-based AI accelerator has been actively explored as a promising scalable solution due to their significant benefits in the low engineering cost and composability. However, previous MCM accelerators are based on homogeneous architectures with fixed dataflow, which encounter major challenges from highly heterogeneous multi-model workloads due to their limited workload adaptivity. Therefore, in this work, we explore the opportunity in the heterogeneous dataflow MCM AI accelerators. We identify the scheduling of multi-model workload on heterogeneous dataflow MCM AI accelerator is an important and challenging problem due to its significance and scale, which reaches O(10^56) even for a two-model workload on 6x6 chiplets. We develop a set of heuristics to navigate the huge scheduling space and codify them into a scheduler, SCAR, with advanced techniques such as inter-chiplet pipelining. Our evaluation on ten multi-model workload scenarios for datacenter multitenancy and AR/VR use-cases has shown the efficacy of our approach, achieving on average 27.6% and 29.6% less energy-delay product (EDP) for the respective applications settings compared to homogeneous baselines.
doi_str_mv 10.48550/arxiv.2405.00790
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2405_00790</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2405_00790</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-84d13195232d0417785fd583dde3dc80e653fab00cb6a73fef13f960bc85c20f3</originalsourceid><addsrcrecordid>eNotz0FLwzAcBfBcPMj0A3gyX6D1n6ZpUm-lqBtsCG7gRShp8s8WjM1IW9Fvr86d3uW9Bz9CbhjkpRIC7nT68p95UYLIAWQNl-Rt2zYv93RrDmjn4Ic93cxh8tkmWgy0WdHXmN5D1HakcaBLnDDFPQ4Y5_HcbA_-GHCiv4s5IG2MwYBJTzGNV-TC6TDi9TkXZPf4sGuX2fr5adU260xXEjJVWsZZLQpeWCiZlEo4KxS3Frk1CrAS3OkewPSVltyhY9zVFfRGCVOA4wty-3974nXH5D90-u7-mN2JyX8ArW1NPg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>SCAR: Scheduling Multi-Model AI Workloads on Heterogeneous Multi-Chiplet Module Accelerators</title><source>arXiv.org</source><creator>Odema, Mohanad ; Chen, Luke ; Kwon, Hyoukjun ; Faruque, Mohammad Abdullah Al</creator><creatorcontrib>Odema, Mohanad ; Chen, Luke ; Kwon, Hyoukjun ; Faruque, Mohammad Abdullah Al</creatorcontrib><description>Emerging multi-model workloads with heavy models like recent large language models significantly increased the compute and memory demands on hardware. To address such increasing demands, designing a scalable hardware architecture became a key problem. Among recent solutions, the 2.5D silicon interposer multi-chip module (MCM)-based AI accelerator has been actively explored as a promising scalable solution due to their significant benefits in the low engineering cost and composability. However, previous MCM accelerators are based on homogeneous architectures with fixed dataflow, which encounter major challenges from highly heterogeneous multi-model workloads due to their limited workload adaptivity. Therefore, in this work, we explore the opportunity in the heterogeneous dataflow MCM AI accelerators. We identify the scheduling of multi-model workload on heterogeneous dataflow MCM AI accelerator is an important and challenging problem due to its significance and scale, which reaches O(10^56) even for a two-model workload on 6x6 chiplets. We develop a set of heuristics to navigate the huge scheduling space and codify them into a scheduler, SCAR, with advanced techniques such as inter-chiplet pipelining. Our evaluation on ten multi-model workload scenarios for datacenter multitenancy and AR/VR use-cases has shown the efficacy of our approach, achieving on average 27.6% and 29.6% less energy-delay product (EDP) for the respective applications settings compared to homogeneous baselines.</description><identifier>DOI: 10.48550/arxiv.2405.00790</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Distributed, Parallel, and Cluster Computing ; Computer Science - Hardware Architecture ; Computer Science - Learning ; Computer Science - Performance</subject><creationdate>2024-05</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2405.00790$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2405.00790$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Odema, Mohanad</creatorcontrib><creatorcontrib>Chen, Luke</creatorcontrib><creatorcontrib>Kwon, Hyoukjun</creatorcontrib><creatorcontrib>Faruque, Mohammad Abdullah Al</creatorcontrib><title>SCAR: Scheduling Multi-Model AI Workloads on Heterogeneous Multi-Chiplet Module Accelerators</title><description>Emerging multi-model workloads with heavy models like recent large language models significantly increased the compute and memory demands on hardware. To address such increasing demands, designing a scalable hardware architecture became a key problem. Among recent solutions, the 2.5D silicon interposer multi-chip module (MCM)-based AI accelerator has been actively explored as a promising scalable solution due to their significant benefits in the low engineering cost and composability. However, previous MCM accelerators are based on homogeneous architectures with fixed dataflow, which encounter major challenges from highly heterogeneous multi-model workloads due to their limited workload adaptivity. Therefore, in this work, we explore the opportunity in the heterogeneous dataflow MCM AI accelerators. We identify the scheduling of multi-model workload on heterogeneous dataflow MCM AI accelerator is an important and challenging problem due to its significance and scale, which reaches O(10^56) even for a two-model workload on 6x6 chiplets. We develop a set of heuristics to navigate the huge scheduling space and codify them into a scheduler, SCAR, with advanced techniques such as inter-chiplet pipelining. Our evaluation on ten multi-model workload scenarios for datacenter multitenancy and AR/VR use-cases has shown the efficacy of our approach, achieving on average 27.6% and 29.6% less energy-delay product (EDP) for the respective applications settings compared to homogeneous baselines.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><subject>Computer Science - Hardware Architecture</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Performance</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz0FLwzAcBfBcPMj0A3gyX6D1n6ZpUm-lqBtsCG7gRShp8s8WjM1IW9Fvr86d3uW9Bz9CbhjkpRIC7nT68p95UYLIAWQNl-Rt2zYv93RrDmjn4Ic93cxh8tkmWgy0WdHXmN5D1HakcaBLnDDFPQ4Y5_HcbA_-GHCiv4s5IG2MwYBJTzGNV-TC6TDi9TkXZPf4sGuX2fr5adU260xXEjJVWsZZLQpeWCiZlEo4KxS3Frk1CrAS3OkewPSVltyhY9zVFfRGCVOA4wty-3974nXH5D90-u7-mN2JyX8ArW1NPg</recordid><startdate>20240501</startdate><enddate>20240501</enddate><creator>Odema, Mohanad</creator><creator>Chen, Luke</creator><creator>Kwon, Hyoukjun</creator><creator>Faruque, Mohammad Abdullah Al</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240501</creationdate><title>SCAR: Scheduling Multi-Model AI Workloads on Heterogeneous Multi-Chiplet Module Accelerators</title><author>Odema, Mohanad ; Chen, Luke ; Kwon, Hyoukjun ; Faruque, Mohammad Abdullah Al</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-84d13195232d0417785fd583dde3dc80e653fab00cb6a73fef13f960bc85c20f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Distributed, Parallel, and Cluster Computing</topic><topic>Computer Science - Hardware Architecture</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Performance</topic><toplevel>online_resources</toplevel><creatorcontrib>Odema, Mohanad</creatorcontrib><creatorcontrib>Chen, Luke</creatorcontrib><creatorcontrib>Kwon, Hyoukjun</creatorcontrib><creatorcontrib>Faruque, Mohammad Abdullah Al</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Odema, Mohanad</au><au>Chen, Luke</au><au>Kwon, Hyoukjun</au><au>Faruque, Mohammad Abdullah Al</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SCAR: Scheduling Multi-Model AI Workloads on Heterogeneous Multi-Chiplet Module Accelerators</atitle><date>2024-05-01</date><risdate>2024</risdate><abstract>Emerging multi-model workloads with heavy models like recent large language models significantly increased the compute and memory demands on hardware. To address such increasing demands, designing a scalable hardware architecture became a key problem. Among recent solutions, the 2.5D silicon interposer multi-chip module (MCM)-based AI accelerator has been actively explored as a promising scalable solution due to their significant benefits in the low engineering cost and composability. However, previous MCM accelerators are based on homogeneous architectures with fixed dataflow, which encounter major challenges from highly heterogeneous multi-model workloads due to their limited workload adaptivity. Therefore, in this work, we explore the opportunity in the heterogeneous dataflow MCM AI accelerators. We identify the scheduling of multi-model workload on heterogeneous dataflow MCM AI accelerator is an important and challenging problem due to its significance and scale, which reaches O(10^56) even for a two-model workload on 6x6 chiplets. We develop a set of heuristics to navigate the huge scheduling space and codify them into a scheduler, SCAR, with advanced techniques such as inter-chiplet pipelining. Our evaluation on ten multi-model workload scenarios for datacenter multitenancy and AR/VR use-cases has shown the efficacy of our approach, achieving on average 27.6% and 29.6% less energy-delay product (EDP) for the respective applications settings compared to homogeneous baselines.</abstract><doi>10.48550/arxiv.2405.00790</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2405.00790
ispartof
issn
language eng
recordid cdi_arxiv_primary_2405_00790
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Distributed, Parallel, and Cluster Computing
Computer Science - Hardware Architecture
Computer Science - Learning
Computer Science - Performance
title SCAR: Scheduling Multi-Model AI Workloads on Heterogeneous Multi-Chiplet Module Accelerators
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T10%3A33%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SCAR:%20Scheduling%20Multi-Model%20AI%20Workloads%20on%20Heterogeneous%20Multi-Chiplet%20Module%20Accelerators&rft.au=Odema,%20Mohanad&rft.date=2024-05-01&rft_id=info:doi/10.48550/arxiv.2405.00790&rft_dat=%3Carxiv_GOX%3E2405_00790%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true