Fold3D: Rethinking and Parallelizing Computational and Communicational Tasks in the Training of Large DNN Models

Training a large DNN (e.g., GPT3) efficiently on commodity clouds is challenging even with the latest 3D parallel training systems (e.g., Megatron v3.0). In particular, along the pipeline parallelism dimension, computational tasks that produce a whole DNN's gradients with multiple input batches...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on parallel and distributed systems 2023-05, Vol.34 (5), p.1432-1449
Hauptverfasser:	Li, Fanxin, Zhao, Shixiong, Qing, Yuhao, Chen, Xusheng, Guan, Xiuxian, Wang, Sen, Zhang, Gong, Cui, Heming
Format:	Artikel
Sprache:	eng
Schlagworte:	3D parallelism Chlorophyll Cloud computing Commodities Computational modeling deep learning distributed training DNN GPU Graphics processing units machine learning Parallel processing pipeline parallelism Pipeline processing Pipelining (computers) Processor scheduling Segments Task analysis Task scheduling Three-dimensional displays Training
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1449
container_issue	5
container_start_page	1432
container_title	IEEE transactions on parallel and distributed systems
container_volume	34
creator	Li, Fanxin Zhao, Shixiong Qing, Yuhao Chen, Xusheng Guan, Xiuxian Wang, Sen Zhang, Gong Cui, Heming
description	Training a large DNN (e.g., GPT3) efficiently on commodity clouds is challenging even with the latest 3D parallel training systems (e.g., Megatron v3.0). In particular, along the pipeline parallelism dimension, computational tasks that produce a whole DNN's gradients with multiple input batches should be concurrently activated; along the data parallelism dimension, a set of heavy-weight communications (for aggregating the accumulated outputs of computational tasks) is inevitably serialized after the pipelined tasks, undermining the training performance (e.g., in Megatron, data parallelism caused all GPUs idle for over 44% of the training time) over commodity cloud networks. To deserialize these communicational and computational tasks, we propose the AIAO scheduling (for 3D parallelism) which slices a DNN into multiple segments, so that the computational tasks processing the same DNN segment can be scheduled together, and the communicational tasks that synchronize this segment can be launched and overlapped (deserialized) with other segments' computational tasks. We realized this idea in our Fold3D training system. Extensive evaluation shows Fold3D eliminated most of the all-GPU 44% idle time in Megatron (caused by data parallelism), leading to 25.2%-42.1% training throughput improvement compared to four notable baselines over various settings; Fold3D 's high performance scaled to many GPUs.
doi_str_mv	10.1109/TPDS.2023.3247883
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2789466134</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10050126</ieee_id><sourcerecordid>2789466134</sourcerecordid><originalsourceid>FETCH-LOGICAL-c337t-393c6602b9f2458bf73721d334b49f593da8559810978ebba1f85864cae4e6bf3</originalsourceid><addsrcrecordid>eNpNkM1OwzAQhCMEEqXwAEgcLHFO8W9ic0MtBaRSKghny0ns1iWNg50c4OlJaJE47Wp2ZrX7RdElghOEoLjJVrO3CYaYTAimKefkKBohxniMESfHfQ8piwVG4jQ6C2ELIaIM0lHUzF1VktkteNXtxtYftl4DVZdgpbyqKl3Z70GZul3Ttaq1rlbV77xXdl1tiz8tU-EjAFuDdqNB5pWth5wzYKH8WoPZcgmeXamrcB6dGFUFfXGo4-h9fp9NH-PFy8PT9G4RF4SkbUwEKZIE4lwYTBnPTUpSjEpCaE6FYYKUijMmeP97ynWeK2Q44wktlKY6yQ0ZR9f7vY13n50Ordy6zvenBolTLmiSIEJ7F9q7Cu9C8NrIxtud8l8SQTmAlQNYOYCVB7B95mqfsVrrf37IIMIJ-QG74nQd</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2789466134</pqid></control><display><type>article</type><title>Fold3D: Rethinking and Parallelizing Computational and Communicational Tasks in the Training of Large DNN Models</title><source>IEEE Electronic Library (IEL)</source><creator>Li, Fanxin ; Zhao, Shixiong ; Qing, Yuhao ; Chen, Xusheng ; Guan, Xiuxian ; Wang, Sen ; Zhang, Gong ; Cui, Heming</creator><creatorcontrib>Li, Fanxin ; Zhao, Shixiong ; Qing, Yuhao ; Chen, Xusheng ; Guan, Xiuxian ; Wang, Sen ; Zhang, Gong ; Cui, Heming</creatorcontrib><description>Training a large DNN (e.g., GPT3) efficiently on commodity clouds is challenging even with the latest 3D parallel training systems (e.g., Megatron v3.0). In particular, along the pipeline parallelism dimension, computational tasks that produce a whole DNN's gradients with multiple input batches should be concurrently activated; along the data parallelism dimension, a set of heavy-weight communications (for aggregating the accumulated outputs of computational tasks) is inevitably serialized after the pipelined tasks, undermining the training performance (e.g., in Megatron, data parallelism caused all GPUs idle for over 44% of the training time) over commodity cloud networks. To deserialize these communicational and computational tasks, we propose the AIAO scheduling (for 3D parallelism) which slices a DNN into multiple segments, so that the computational tasks processing the same DNN segment can be scheduled together, and the communicational tasks that synchronize this segment can be launched and overlapped (deserialized) with other segments' computational tasks. We realized this idea in our Fold3D training system. Extensive evaluation shows Fold3D eliminated most of the all-GPU 44% idle time in Megatron (caused by data parallelism), leading to 25.2%-42.1% training throughput improvement compared to four notable baselines over various settings; Fold3D 's high performance scaled to many GPUs.</description><identifier>ISSN: 1045-9219</identifier><identifier>EISSN: 1558-2183</identifier><identifier>DOI: 10.1109/TPDS.2023.3247883</identifier><identifier>CODEN: ITDSEO</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>3D parallelism ; Chlorophyll ; Cloud computing ; Commodities ; Computational modeling ; deep learning ; distributed training ; DNN ; GPU ; Graphics processing units ; machine learning ; Parallel processing ; pipeline parallelism ; Pipeline processing ; Pipelining (computers) ; Processor scheduling ; Segments ; Task analysis ; Task scheduling ; Three-dimensional displays ; Training</subject><ispartof>IEEE transactions on parallel and distributed systems, 2023-05, Vol.34 (5), p.1432-1449</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c337t-393c6602b9f2458bf73721d334b49f593da8559810978ebba1f85864cae4e6bf3</citedby><cites>FETCH-LOGICAL-c337t-393c6602b9f2458bf73721d334b49f593da8559810978ebba1f85864cae4e6bf3</cites><orcidid>0000-0001-7746-440X ; 0000-0002-2807-9780 ; 0000-0003-2268-3036 ; 0000-0002-1643-2583 ; 0000-0001-6133-8388 ; 0000-0003-0283-7050</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10050126$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,792,27903,27904,54736</link.rule.ids></links><search><creatorcontrib>Li, Fanxin</creatorcontrib><creatorcontrib>Zhao, Shixiong</creatorcontrib><creatorcontrib>Qing, Yuhao</creatorcontrib><creatorcontrib>Chen, Xusheng</creatorcontrib><creatorcontrib>Guan, Xiuxian</creatorcontrib><creatorcontrib>Wang, Sen</creatorcontrib><creatorcontrib>Zhang, Gong</creatorcontrib><creatorcontrib>Cui, Heming</creatorcontrib><title>Fold3D: Rethinking and Parallelizing Computational and Communicational Tasks in the Training of Large DNN Models</title><title>IEEE transactions on parallel and distributed systems</title><addtitle>TPDS</addtitle><description>Training a large DNN (e.g., GPT3) efficiently on commodity clouds is challenging even with the latest 3D parallel training systems (e.g., Megatron v3.0). In particular, along the pipeline parallelism dimension, computational tasks that produce a whole DNN's gradients with multiple input batches should be concurrently activated; along the data parallelism dimension, a set of heavy-weight communications (for aggregating the accumulated outputs of computational tasks) is inevitably serialized after the pipelined tasks, undermining the training performance (e.g., in Megatron, data parallelism caused all GPUs idle for over 44% of the training time) over commodity cloud networks. To deserialize these communicational and computational tasks, we propose the AIAO scheduling (for 3D parallelism) which slices a DNN into multiple segments, so that the computational tasks processing the same DNN segment can be scheduled together, and the communicational tasks that synchronize this segment can be launched and overlapped (deserialized) with other segments' computational tasks. We realized this idea in our Fold3D training system. Extensive evaluation shows Fold3D eliminated most of the all-GPU 44% idle time in Megatron (caused by data parallelism), leading to 25.2%-42.1% training throughput improvement compared to four notable baselines over various settings; Fold3D 's high performance scaled to many GPUs.</description><subject>3D parallelism</subject><subject>Chlorophyll</subject><subject>Cloud computing</subject><subject>Commodities</subject><subject>Computational modeling</subject><subject>deep learning</subject><subject>distributed training</subject><subject>DNN</subject><subject>GPU</subject><subject>Graphics processing units</subject><subject>machine learning</subject><subject>Parallel processing</subject><subject>pipeline parallelism</subject><subject>Pipeline processing</subject><subject>Pipelining (computers)</subject><subject>Processor scheduling</subject><subject>Segments</subject><subject>Task analysis</subject><subject>Task scheduling</subject><subject>Three-dimensional displays</subject><subject>Training</subject><issn>1045-9219</issn><issn>1558-2183</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><recordid>eNpNkM1OwzAQhCMEEqXwAEgcLHFO8W9ic0MtBaRSKghny0ns1iWNg50c4OlJaJE47Wp2ZrX7RdElghOEoLjJVrO3CYaYTAimKefkKBohxniMESfHfQ8piwVG4jQ6C2ELIaIM0lHUzF1VktkteNXtxtYftl4DVZdgpbyqKl3Z70GZul3Ttaq1rlbV77xXdl1tiz8tU-EjAFuDdqNB5pWth5wzYKH8WoPZcgmeXamrcB6dGFUFfXGo4-h9fp9NH-PFy8PT9G4RF4SkbUwEKZIE4lwYTBnPTUpSjEpCaE6FYYKUijMmeP97ynWeK2Q44wktlKY6yQ0ZR9f7vY13n50Ordy6zvenBolTLmiSIEJ7F9q7Cu9C8NrIxtud8l8SQTmAlQNYOYCVB7B95mqfsVrrf37IIMIJ-QG74nQd</recordid><startdate>20230501</startdate><enddate>20230501</enddate><creator>Li, Fanxin</creator><creator>Zhao, Shixiong</creator><creator>Qing, Yuhao</creator><creator>Chen, Xusheng</creator><creator>Guan, Xiuxian</creator><creator>Wang, Sen</creator><creator>Zhang, Gong</creator><creator>Cui, Heming</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-7746-440X</orcidid><orcidid>https://orcid.org/0000-0002-2807-9780</orcidid><orcidid>https://orcid.org/0000-0003-2268-3036</orcidid><orcidid>https://orcid.org/0000-0002-1643-2583</orcidid><orcidid>https://orcid.org/0000-0001-6133-8388</orcidid><orcidid>https://orcid.org/0000-0003-0283-7050</orcidid></search><sort><creationdate>20230501</creationdate><title>Fold3D: Rethinking and Parallelizing Computational and Communicational Tasks in the Training of Large DNN Models</title><author>Li, Fanxin ; Zhao, Shixiong ; Qing, Yuhao ; Chen, Xusheng ; Guan, Xiuxian ; Wang, Sen ; Zhang, Gong ; Cui, Heming</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c337t-393c6602b9f2458bf73721d334b49f593da8559810978ebba1f85864cae4e6bf3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>3D parallelism</topic><topic>Chlorophyll</topic><topic>Cloud computing</topic><topic>Commodities</topic><topic>Computational modeling</topic><topic>deep learning</topic><topic>distributed training</topic><topic>DNN</topic><topic>GPU</topic><topic>Graphics processing units</topic><topic>machine learning</topic><topic>Parallel processing</topic><topic>pipeline parallelism</topic><topic>Pipeline processing</topic><topic>Pipelining (computers)</topic><topic>Processor scheduling</topic><topic>Segments</topic><topic>Task analysis</topic><topic>Task scheduling</topic><topic>Three-dimensional displays</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Li, Fanxin</creatorcontrib><creatorcontrib>Zhao, Shixiong</creatorcontrib><creatorcontrib>Qing, Yuhao</creatorcontrib><creatorcontrib>Chen, Xusheng</creatorcontrib><creatorcontrib>Guan, Xiuxian</creatorcontrib><creatorcontrib>Wang, Sen</creatorcontrib><creatorcontrib>Zhang, Gong</creatorcontrib><creatorcontrib>Cui, Heming</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on parallel and distributed systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Li, Fanxin</au><au>Zhao, Shixiong</au><au>Qing, Yuhao</au><au>Chen, Xusheng</au><au>Guan, Xiuxian</au><au>Wang, Sen</au><au>Zhang, Gong</au><au>Cui, Heming</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Fold3D: Rethinking and Parallelizing Computational and Communicational Tasks in the Training of Large DNN Models</atitle><jtitle>IEEE transactions on parallel and distributed systems</jtitle><stitle>TPDS</stitle><date>2023-05-01</date><risdate>2023</risdate><volume>34</volume><issue>5</issue><spage>1432</spage><epage>1449</epage><pages>1432-1449</pages><issn>1045-9219</issn><eissn>1558-2183</eissn><coden>ITDSEO</coden><abstract>Training a large DNN (e.g., GPT3) efficiently on commodity clouds is challenging even with the latest 3D parallel training systems (e.g., Megatron v3.0). In particular, along the pipeline parallelism dimension, computational tasks that produce a whole DNN's gradients with multiple input batches should be concurrently activated; along the data parallelism dimension, a set of heavy-weight communications (for aggregating the accumulated outputs of computational tasks) is inevitably serialized after the pipelined tasks, undermining the training performance (e.g., in Megatron, data parallelism caused all GPUs idle for over 44% of the training time) over commodity cloud networks. To deserialize these communicational and computational tasks, we propose the AIAO scheduling (for 3D parallelism) which slices a DNN into multiple segments, so that the computational tasks processing the same DNN segment can be scheduled together, and the communicational tasks that synchronize this segment can be launched and overlapped (deserialized) with other segments' computational tasks. We realized this idea in our Fold3D training system. Extensive evaluation shows Fold3D eliminated most of the all-GPU 44% idle time in Megatron (caused by data parallelism), leading to 25.2%-42.1% training throughput improvement compared to four notable baselines over various settings; Fold3D 's high performance scaled to many GPUs.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TPDS.2023.3247883</doi><tpages>18</tpages><orcidid>https://orcid.org/0000-0001-7746-440X</orcidid><orcidid>https://orcid.org/0000-0002-2807-9780</orcidid><orcidid>https://orcid.org/0000-0003-2268-3036</orcidid><orcidid>https://orcid.org/0000-0002-1643-2583</orcidid><orcidid>https://orcid.org/0000-0001-6133-8388</orcidid><orcidid>https://orcid.org/0000-0003-0283-7050</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1045-9219
ispartof	IEEE transactions on parallel and distributed systems, 2023-05, Vol.34 (5), p.1432-1449
issn	1045-9219 1558-2183
language	eng
recordid	cdi_proquest_journals_2789466134
source	IEEE Electronic Library (IEL)
subjects	3D parallelism Chlorophyll Cloud computing Commodities Computational modeling deep learning distributed training DNN GPU Graphics processing units machine learning Parallel processing pipeline parallelism Pipeline processing Pipelining (computers) Processor scheduling Segments Task analysis Task scheduling Three-dimensional displays Training
title	Fold3D: Rethinking and Parallelizing Computational and Communicational Tasks in the Training of Large DNN Models
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-25T20%3A41%3A43IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Fold3D:%20Rethinking%20and%20Parallelizing%20Computational%20and%20Communicational%20Tasks%20in%20the%20Training%20of%20Large%20DNN%20Models&rft.jtitle=IEEE%20transactions%20on%20parallel%20and%20distributed%20systems&rft.au=Li,%20Fanxin&rft.date=2023-05-01&rft.volume=34&rft.issue=5&rft.spage=1432&rft.epage=1449&rft.pages=1432-1449&rft.issn=1045-9219&rft.eissn=1558-2183&rft.coden=ITDSEO&rft_id=info:doi/10.1109/TPDS.2023.3247883&rft_dat=%3Cproquest_cross%3E2789466134%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2789466134&rft_id=info:pmid/&rft_ieee_id=10050126&rfr_iscdi=true