T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives

Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communi...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-01
Hauptverfasser:	Pati, Suchita, Shaizeen Aga, Islam, Mahzabeen, Jayasena, Nuwan, Sinclair, Matthew D
Format:	Artikel
Sprache:	eng
Schlagworte:	Co-design Communication Hardware Large language models Software Tensors
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Pati, Suchita Shaizeen Aga Islam, Mahzabeen Jayasena, Nuwan Sinclair, Matthew D
description	Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy. To overcome these challenges, we propose T3 which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute. T3 transparently fuses producer operations with the subsequent communication via a simple configuration of the producer's output address space and requires minor software changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute, and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention, and efficiently overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 speeds up communication-heavy sublayers by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as models scale: geomean 29% for sublayers in $\sim$500-billion parameter models, PALM and MT-NLG.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2920373093</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2920373093</sourcerecordid><originalsourceid>FETCH-proquest_journals_29203730933</originalsourceid><addsrcrecordid>eNqNi0EKwjAURIMgWLR3CAjuCjGx1rotFnduCi4l1N-QGpP4k_b8tuABXL03zMyCJFyIfXY6cL4iaQg9Y4wfC57nIiH3Rpxpg9IGLxFsnL19aavoblKtFOAcOoe01hYyhXLCk95GQCM9dR2t3NsPEaZD5YyBNuoRwoYsO2kCpD-uyba-NNU18-g-A4T46N2AdqoevORMFIKVQvy3-gKX-EBE</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2920373093</pqid></control><display><type>article</type><title>T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives</title><source>Free E- Journals</source><creator>Pati, Suchita ; Shaizeen Aga ; Islam, Mahzabeen ; Jayasena, Nuwan ; Sinclair, Matthew D</creator><creatorcontrib>Pati, Suchita ; Shaizeen Aga ; Islam, Mahzabeen ; Jayasena, Nuwan ; Sinclair, Matthew D</creatorcontrib><description>Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy. To overcome these challenges, we propose T3 which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute. T3 transparently fuses producer operations with the subsequent communication via a simple configuration of the producer's output address space and requires minor software changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute, and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention, and efficiently overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 speeds up communication-heavy sublayers by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as models scale: geomean 29% for sublayers in $\sim$500-billion parameter models, PALM and MT-NLG.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Co-design ; Communication ; Hardware ; Large language models ; Software ; Tensors</subject><ispartof>arXiv.org, 2024-01</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Pati, Suchita</creatorcontrib><creatorcontrib>Shaizeen Aga</creatorcontrib><creatorcontrib>Islam, Mahzabeen</creatorcontrib><creatorcontrib>Jayasena, Nuwan</creatorcontrib><creatorcontrib>Sinclair, Matthew D</creatorcontrib><title>T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives</title><title>arXiv.org</title><description>Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy. To overcome these challenges, we propose T3 which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute. T3 transparently fuses producer operations with the subsequent communication via a simple configuration of the producer's output address space and requires minor software changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute, and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention, and efficiently overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 speeds up communication-heavy sublayers by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as models scale: geomean 29% for sublayers in $\sim$500-billion parameter models, PALM and MT-NLG.</description><subject>Co-design</subject><subject>Communication</subject><subject>Hardware</subject><subject>Large language models</subject><subject>Software</subject><subject>Tensors</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNi0EKwjAURIMgWLR3CAjuCjGx1rotFnduCi4l1N-QGpP4k_b8tuABXL03zMyCJFyIfXY6cL4iaQg9Y4wfC57nIiH3Rpxpg9IGLxFsnL19aavoblKtFOAcOoe01hYyhXLCk95GQCM9dR2t3NsPEaZD5YyBNuoRwoYsO2kCpD-uyba-NNU18-g-A4T46N2AdqoevORMFIKVQvy3-gKX-EBE</recordid><startdate>20240130</startdate><enddate>20240130</enddate><creator>Pati, Suchita</creator><creator>Shaizeen Aga</creator><creator>Islam, Mahzabeen</creator><creator>Jayasena, Nuwan</creator><creator>Sinclair, Matthew D</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240130</creationdate><title>T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives</title><author>Pati, Suchita ; Shaizeen Aga ; Islam, Mahzabeen ; Jayasena, Nuwan ; Sinclair, Matthew D</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_29203730933</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Co-design</topic><topic>Communication</topic><topic>Hardware</topic><topic>Large language models</topic><topic>Software</topic><topic>Tensors</topic><toplevel>online_resources</toplevel><creatorcontrib>Pati, Suchita</creatorcontrib><creatorcontrib>Shaizeen Aga</creatorcontrib><creatorcontrib>Islam, Mahzabeen</creatorcontrib><creatorcontrib>Jayasena, Nuwan</creatorcontrib><creatorcontrib>Sinclair, Matthew D</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Pati, Suchita</au><au>Shaizeen Aga</au><au>Islam, Mahzabeen</au><au>Jayasena, Nuwan</au><au>Sinclair, Matthew D</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives</atitle><jtitle>arXiv.org</jtitle><date>2024-01-30</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy. To overcome these challenges, we propose T3 which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute. T3 transparently fuses producer operations with the subsequent communication via a simple configuration of the producer's output address space and requires minor software changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute, and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention, and efficiently overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 speeds up communication-heavy sublayers by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as models scale: geomean 29% for sublayers in $\sim$500-billion parameter models, PALM and MT-NLG.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-01
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2920373093
source	Free E- Journals
subjects	Co-design Communication Hardware Large language models Software Tensors
title	T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-12T17%3A03%3A42IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=T3:%20Transparent%20Tracking%20&%20Triggering%20for%20Fine-grained%20Overlap%20of%20Compute%20&%20Collectives&rft.jtitle=arXiv.org&rft.au=Pati,%20Suchita&rft.date=2024-01-30&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2920373093%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2920373093&rft_id=info:pmid/&rfr_iscdi=true