Computation vs. Communication Scaling for Future Transformers on Future Hardware
Scaling neural network models has delivered dramatic quality gains across ML problems. However, this scaling has increased the reliance on efficient distributed training techniques. Accordingly, as with other distributed computing scenarios, it is important to understand how will compute and communi...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Pati, Suchita Aga, Shaizeen Islam, Mahzabeen Jayasena, Nuwan Sinclair, Matthew D |
description | Scaling neural network models has delivered dramatic quality gains across ML
problems. However, this scaling has increased the reliance on efficient
distributed training techniques. Accordingly, as with other distributed
computing scenarios, it is important to understand how will compute and
communication scale relative to one another as models scale and hardware
evolves? A careful study which answers this question can better guide the
design of future systems which can efficiently train future large models.
Accordingly, this work provides a comprehensive multi-axial (algorithmic,
empirical, hardware evolution) analysis of compute vs. communication
(Comp-vs.-Comm) scaling for future Transformer models on future hardware.
First, our algorithmic analysis shows that compute generally enjoys an edge
over communication as models scale. However, since memory capacity scales
slower than compute, these trends are being stressed. Next, we quantify this
edge by empirically studying how Comp-vs.-Comm scales for future models on
future hardware. To avoid profiling numerous Transformer models across many
setups, we extract execution regions and project costs using operator models.
This allows a spectrum (hundreds) of future model/hardware scenarios to be
accurately studied ($ |
doi_str_mv | 10.48550/arxiv.2302.02825 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2302_02825</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2302_02825</sourcerecordid><originalsourceid>FETCH-LOGICAL-a675-d3aaadd51b714851e53e286e1f3feff1073781b3f5df47082b4eaba1767e82eb3</originalsourceid><addsrcrecordid>eNotj81qwzAQhHXJoSR9gJ6qF7CjH8vStZikKQQSqO9mFa2KILbD2k7bt6-b5DTMMAzzMfYiRV44Y8Qa6Cddc6WFyoVyyjyxY9W3l2mEMfUdvw45n307del0Tz5PcE7dF4898e00ToS8JuiG2bdIA58rj3gHFL6BcMUWEc4DPj90yertpq522f7w_lG97TMorcmCBoAQjPRWztckGo3KlSijjhijFFZbJ72OJsTCCqd8geBB2tKiU-j1kr3eZ29IzYVSC_Tb_KM1NzT9B70dSoU</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Computation vs. Communication Scaling for Future Transformers on Future Hardware</title><source>arXiv.org</source><creator>Pati, Suchita ; Aga, Shaizeen ; Islam, Mahzabeen ; Jayasena, Nuwan ; Sinclair, Matthew D</creator><creatorcontrib>Pati, Suchita ; Aga, Shaizeen ; Islam, Mahzabeen ; Jayasena, Nuwan ; Sinclair, Matthew D</creatorcontrib><description>Scaling neural network models has delivered dramatic quality gains across ML
problems. However, this scaling has increased the reliance on efficient
distributed training techniques. Accordingly, as with other distributed
computing scenarios, it is important to understand how will compute and
communication scale relative to one another as models scale and hardware
evolves? A careful study which answers this question can better guide the
design of future systems which can efficiently train future large models.
Accordingly, this work provides a comprehensive multi-axial (algorithmic,
empirical, hardware evolution) analysis of compute vs. communication
(Comp-vs.-Comm) scaling for future Transformer models on future hardware.
First, our algorithmic analysis shows that compute generally enjoys an edge
over communication as models scale. However, since memory capacity scales
slower than compute, these trends are being stressed. Next, we quantify this
edge by empirically studying how Comp-vs.-Comm scales for future models on
future hardware. To avoid profiling numerous Transformer models across many
setups, we extract execution regions and project costs using operator models.
This allows a spectrum (hundreds) of future model/hardware scenarios to be
accurately studied ($<$15% error), and reduces profiling costs by 2100$\times$.
Our experiments show that communication will be a significant portion (40-75%)
of runtime as models and hardware evolve. Moreover, communication which is
hidden by overlapped computation in today's models often cannot be hidden in
future, larger models. Overall, this work highlights the increasingly large
role communication will play as models scale and discusses techniques and
upcoming technologies that can help address it.</description><identifier>DOI: 10.48550/arxiv.2302.02825</identifier><language>eng</language><subject>Computer Science - Distributed, Parallel, and Cluster Computing ; Computer Science - Hardware Architecture</subject><creationdate>2023-02</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2302.02825$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2302.02825$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Pati, Suchita</creatorcontrib><creatorcontrib>Aga, Shaizeen</creatorcontrib><creatorcontrib>Islam, Mahzabeen</creatorcontrib><creatorcontrib>Jayasena, Nuwan</creatorcontrib><creatorcontrib>Sinclair, Matthew D</creatorcontrib><title>Computation vs. Communication Scaling for Future Transformers on Future Hardware</title><description>Scaling neural network models has delivered dramatic quality gains across ML
problems. However, this scaling has increased the reliance on efficient
distributed training techniques. Accordingly, as with other distributed
computing scenarios, it is important to understand how will compute and
communication scale relative to one another as models scale and hardware
evolves? A careful study which answers this question can better guide the
design of future systems which can efficiently train future large models.
Accordingly, this work provides a comprehensive multi-axial (algorithmic,
empirical, hardware evolution) analysis of compute vs. communication
(Comp-vs.-Comm) scaling for future Transformer models on future hardware.
First, our algorithmic analysis shows that compute generally enjoys an edge
over communication as models scale. However, since memory capacity scales
slower than compute, these trends are being stressed. Next, we quantify this
edge by empirically studying how Comp-vs.-Comm scales for future models on
future hardware. To avoid profiling numerous Transformer models across many
setups, we extract execution regions and project costs using operator models.
This allows a spectrum (hundreds) of future model/hardware scenarios to be
accurately studied ($<$15% error), and reduces profiling costs by 2100$\times$.
Our experiments show that communication will be a significant portion (40-75%)
of runtime as models and hardware evolve. Moreover, communication which is
hidden by overlapped computation in today's models often cannot be hidden in
future, larger models. Overall, this work highlights the increasingly large
role communication will play as models scale and discusses techniques and
upcoming technologies that can help address it.</description><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><subject>Computer Science - Hardware Architecture</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81qwzAQhHXJoSR9gJ6qF7CjH8vStZikKQQSqO9mFa2KILbD2k7bt6-b5DTMMAzzMfYiRV44Y8Qa6Cddc6WFyoVyyjyxY9W3l2mEMfUdvw45n307del0Tz5PcE7dF4898e00ToS8JuiG2bdIA58rj3gHFL6BcMUWEc4DPj90yertpq522f7w_lG97TMorcmCBoAQjPRWztckGo3KlSijjhijFFZbJ72OJsTCCqd8geBB2tKiU-j1kr3eZ29IzYVSC_Tb_KM1NzT9B70dSoU</recordid><startdate>20230206</startdate><enddate>20230206</enddate><creator>Pati, Suchita</creator><creator>Aga, Shaizeen</creator><creator>Islam, Mahzabeen</creator><creator>Jayasena, Nuwan</creator><creator>Sinclair, Matthew D</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230206</creationdate><title>Computation vs. Communication Scaling for Future Transformers on Future Hardware</title><author>Pati, Suchita ; Aga, Shaizeen ; Islam, Mahzabeen ; Jayasena, Nuwan ; Sinclair, Matthew D</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a675-d3aaadd51b714851e53e286e1f3feff1073781b3f5df47082b4eaba1767e82eb3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Distributed, Parallel, and Cluster Computing</topic><topic>Computer Science - Hardware Architecture</topic><toplevel>online_resources</toplevel><creatorcontrib>Pati, Suchita</creatorcontrib><creatorcontrib>Aga, Shaizeen</creatorcontrib><creatorcontrib>Islam, Mahzabeen</creatorcontrib><creatorcontrib>Jayasena, Nuwan</creatorcontrib><creatorcontrib>Sinclair, Matthew D</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Pati, Suchita</au><au>Aga, Shaizeen</au><au>Islam, Mahzabeen</au><au>Jayasena, Nuwan</au><au>Sinclair, Matthew D</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Computation vs. Communication Scaling for Future Transformers on Future Hardware</atitle><date>2023-02-06</date><risdate>2023</risdate><abstract>Scaling neural network models has delivered dramatic quality gains across ML
problems. However, this scaling has increased the reliance on efficient
distributed training techniques. Accordingly, as with other distributed
computing scenarios, it is important to understand how will compute and
communication scale relative to one another as models scale and hardware
evolves? A careful study which answers this question can better guide the
design of future systems which can efficiently train future large models.
Accordingly, this work provides a comprehensive multi-axial (algorithmic,
empirical, hardware evolution) analysis of compute vs. communication
(Comp-vs.-Comm) scaling for future Transformer models on future hardware.
First, our algorithmic analysis shows that compute generally enjoys an edge
over communication as models scale. However, since memory capacity scales
slower than compute, these trends are being stressed. Next, we quantify this
edge by empirically studying how Comp-vs.-Comm scales for future models on
future hardware. To avoid profiling numerous Transformer models across many
setups, we extract execution regions and project costs using operator models.
This allows a spectrum (hundreds) of future model/hardware scenarios to be
accurately studied ($<$15% error), and reduces profiling costs by 2100$\times$.
Our experiments show that communication will be a significant portion (40-75%)
of runtime as models and hardware evolve. Moreover, communication which is
hidden by overlapped computation in today's models often cannot be hidden in
future, larger models. Overall, this work highlights the increasingly large
role communication will play as models scale and discusses techniques and
upcoming technologies that can help address it.</abstract><doi>10.48550/arxiv.2302.02825</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2302.02825 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2302_02825 |
source | arXiv.org |
subjects | Computer Science - Distributed, Parallel, and Cluster Computing Computer Science - Hardware Architecture |
title | Computation vs. Communication Scaling for Future Transformers on Future Hardware |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T01%3A00%3A28IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Computation%20vs.%20Communication%20Scaling%20for%20Future%20Transformers%20on%20Future%20Hardware&rft.au=Pati,%20Suchita&rft.date=2023-02-06&rft_id=info:doi/10.48550/arxiv.2302.02825&rft_dat=%3Carxiv_GOX%3E2302_02825%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |