T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives
Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communi...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2024-01 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Pati, Suchita Shaizeen Aga Islam, Mahzabeen Jayasena, Nuwan Sinclair, Matthew D |
description | Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy. To overcome these challenges, we propose T3 which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute. T3 transparently fuses producer operations with the subsequent communication via a simple configuration of the producer's output address space and requires minor software changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute, and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention, and efficiently overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 speeds up communication-heavy sublayers by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as models scale: geomean 29% for sublayers in \(\sim\)500-billion parameter models, PALM and MT-NLG. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2920373093</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2920373093</sourcerecordid><originalsourceid>FETCH-proquest_journals_29203730933</originalsourceid><addsrcrecordid>eNqNi0EKwjAURIMgWLR3CAjuCjGx1rotFnduCi4l1N-QGpP4k_b8tuABXL03zMyCJFyIfXY6cL4iaQg9Y4wfC57nIiH3Rpxpg9IGLxFsnL19aavoblKtFOAcOoe01hYyhXLCk95GQCM9dR2t3NsPEaZD5YyBNuoRwoYsO2kCpD-uyba-NNU18-g-A4T46N2AdqoevORMFIKVQvy3-gKX-EBE</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2920373093</pqid></control><display><type>article</type><title>T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives</title><source>Free E- Journals</source><creator>Pati, Suchita ; Shaizeen Aga ; Islam, Mahzabeen ; Jayasena, Nuwan ; Sinclair, Matthew D</creator><creatorcontrib>Pati, Suchita ; Shaizeen Aga ; Islam, Mahzabeen ; Jayasena, Nuwan ; Sinclair, Matthew D</creatorcontrib><description>Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy. To overcome these challenges, we propose T3 which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute. T3 transparently fuses producer operations with the subsequent communication via a simple configuration of the producer's output address space and requires minor software changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute, and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention, and efficiently overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 speeds up communication-heavy sublayers by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as models scale: geomean 29% for sublayers in \(\sim\)500-billion parameter models, PALM and MT-NLG.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Co-design ; Communication ; Hardware ; Large language models ; Software ; Tensors</subject><ispartof>arXiv.org, 2024-01</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Pati, Suchita</creatorcontrib><creatorcontrib>Shaizeen Aga</creatorcontrib><creatorcontrib>Islam, Mahzabeen</creatorcontrib><creatorcontrib>Jayasena, Nuwan</creatorcontrib><creatorcontrib>Sinclair, Matthew D</creatorcontrib><title>T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives</title><title>arXiv.org</title><description>Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy. To overcome these challenges, we propose T3 which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute. T3 transparently fuses producer operations with the subsequent communication via a simple configuration of the producer's output address space and requires minor software changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute, and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention, and efficiently overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 speeds up communication-heavy sublayers by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as models scale: geomean 29% for sublayers in \(\sim\)500-billion parameter models, PALM and MT-NLG.</description><subject>Co-design</subject><subject>Communication</subject><subject>Hardware</subject><subject>Large language models</subject><subject>Software</subject><subject>Tensors</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNi0EKwjAURIMgWLR3CAjuCjGx1rotFnduCi4l1N-QGpP4k_b8tuABXL03zMyCJFyIfXY6cL4iaQg9Y4wfC57nIiH3Rpxpg9IGLxFsnL19aavoblKtFOAcOoe01hYyhXLCk95GQCM9dR2t3NsPEaZD5YyBNuoRwoYsO2kCpD-uyba-NNU18-g-A4T46N2AdqoevORMFIKVQvy3-gKX-EBE</recordid><startdate>20240130</startdate><enddate>20240130</enddate><creator>Pati, Suchita</creator><creator>Shaizeen Aga</creator><creator>Islam, Mahzabeen</creator><creator>Jayasena, Nuwan</creator><creator>Sinclair, Matthew D</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240130</creationdate><title>T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives</title><author>Pati, Suchita ; Shaizeen Aga ; Islam, Mahzabeen ; Jayasena, Nuwan ; Sinclair, Matthew D</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_29203730933</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Co-design</topic><topic>Communication</topic><topic>Hardware</topic><topic>Large language models</topic><topic>Software</topic><topic>Tensors</topic><toplevel>online_resources</toplevel><creatorcontrib>Pati, Suchita</creatorcontrib><creatorcontrib>Shaizeen Aga</creatorcontrib><creatorcontrib>Islam, Mahzabeen</creatorcontrib><creatorcontrib>Jayasena, Nuwan</creatorcontrib><creatorcontrib>Sinclair, Matthew D</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Pati, Suchita</au><au>Shaizeen Aga</au><au>Islam, Mahzabeen</au><au>Jayasena, Nuwan</au><au>Sinclair, Matthew D</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives</atitle><jtitle>arXiv.org</jtitle><date>2024-01-30</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy. To overcome these challenges, we propose T3 which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute. T3 transparently fuses producer operations with the subsequent communication via a simple configuration of the producer's output address space and requires minor software changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute, and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention, and efficiently overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 speeds up communication-heavy sublayers by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as models scale: geomean 29% for sublayers in \(\sim\)500-billion parameter models, PALM and MT-NLG.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-01 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2920373093 |
source | Free E- Journals |
subjects | Co-design Communication Hardware Large language models Software Tensors |
title | T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-12T17%3A03%3A42IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=T3:%20Transparent%20Tracking%20&%20Triggering%20for%20Fine-grained%20Overlap%20of%20Compute%20&%20Collectives&rft.jtitle=arXiv.org&rft.au=Pati,%20Suchita&rft.date=2024-01-30&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2920373093%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2920373093&rft_id=info:pmid/&rfr_iscdi=true |