v Pipe : A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training

The increasing computational complexity of DNNs achieved unprecedented successes in various areas such as machine vision and natural language processing (NLP), e.g., the recent advanced Transformer has billions of parameters. However, as large-scale DNNs significantly exceed GPU’s physical memory li...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on parallel and distributed systems 2022-01, Vol.33 (3), p.489
Hauptverfasser:	Zhao, Shixiong, Li, Fanxin, Chen, Xusheng, Guan, Xiuxian, Jiang, Jianyu, Huang, Dong, Yuhao Qing, Wang, Sen, Wang, Peng, Zhang, Gong, Cheng, Li, Luo, Ping, Cui, Heming
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer architecture Graphics processing units Machine vision Memory management Natural language processing Partitioning Search algorithms Training Transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue	3
container_start_page	489
container_title	IEEE transactions on parallel and distributed systems
container_volume	33
creator	Zhao, Shixiong Li, Fanxin Chen, Xusheng Guan, Xiuxian Jiang, Jianyu Huang, Dong Yuhao Qing Wang, Sen Wang, Peng Zhang, Gong Cheng, Li Luo, Ping Cui, Heming
description	The increasing computational complexity of DNNs achieved unprecedented successes in various areas such as machine vision and natural language processing (NLP), e.g., the recent advanced Transformer has billions of parameters. However, as large-scale DNNs significantly exceed GPU’s physical memory limit, they cannot be trained by conventional methods such as data parallelism. Pipeline parallelism that partitions a large DNN into small subnets and trains them on different GPUs is a plausible solution. Unfortunately, the layer partitioning and memory management in existing pipeline parallel systems are fixed during training, making them easily impeded by out-of-memory errors and the GPU under-utilization. These drawbacks amplify when performing neural architecture search (NAS) such as the evolved Transformer, where different network architectures of Transformer needed to be trained repeatedly. vPipe is the first system that transparently provides dynamic layer partitioning and memory management for pipeline parallelism. vPipe has two unique contributions, including (1) an online algorithm for searching a near-optimal layer partitioning and memory management plan, and (2) a live layer migration protocol for re-balancing the layer distribution across a training pipeline. vPipe improved the training throughput of two notable baselines (Pipedream and GPipe) by 61.4-463.4 percent and 24.8-291.3 percent on various large DNNs and training settings.
doi_str_mv	10.1109/TPDS.2021.3094364
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2560131446</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2560131446</sourcerecordid><originalsourceid>FETCH-proquest_journals_25601314463</originalsourceid><addsrcrecordid>eNqNjMtKAzEUhoMoWC8P4O6A6xlzMsnQcVdspatSmMFtidMzekrM1CRT0Kc3iA_g6v_4b0LcoSwRZfPQbZdtqaTCspKNrmp9JmZozLxQOK_OM0ttikZhcymuYjxIidpIPRP-BFs-EjzCAl44pMk6_qY9LPqeHAWbePTQfsVEHzCMIfvvTCf2b7AaBu6ZfALr99D21tlXR79vjn0GG6xz5GC52UAXLPu8uhEXg3WRbv_0Wtw_r7qndXEM4-dEMe0O4xR8jnbK1BIr1Lqu_tf6AecrT0g</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2560131446</pqid></control><display><type>article</type><title>v Pipe : A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training</title><source>IEEE Electronic Library (IEL)</source><creator>Zhao, Shixiong ; Li, Fanxin ; Chen, Xusheng ; Guan, Xiuxian ; Jiang, Jianyu ; Huang, Dong ; Yuhao Qing ; Wang, Sen ; Wang, Peng ; Zhang, Gong ; Cheng, Li ; Luo, Ping ; Cui, Heming</creator><creatorcontrib>Zhao, Shixiong ; Li, Fanxin ; Chen, Xusheng ; Guan, Xiuxian ; Jiang, Jianyu ; Huang, Dong ; Yuhao Qing ; Wang, Sen ; Wang, Peng ; Zhang, Gong ; Cheng, Li ; Luo, Ping ; Cui, Heming</creatorcontrib><description>The increasing computational complexity of DNNs achieved unprecedented successes in various areas such as machine vision and natural language processing (NLP), e.g., the recent advanced Transformer has billions of parameters. However, as large-scale DNNs significantly exceed GPU’s physical memory limit, they cannot be trained by conventional methods such as data parallelism. Pipeline parallelism that partitions a large DNN into small subnets and trains them on different GPUs is a plausible solution. Unfortunately, the layer partitioning and memory management in existing pipeline parallel systems are fixed during training, making them easily impeded by out-of-memory errors and the GPU under-utilization. These drawbacks amplify when performing neural architecture search (NAS) such as the evolved Transformer, where different network architectures of Transformer needed to be trained repeatedly. vPipe is the first system that transparently provides dynamic layer partitioning and memory management for pipeline parallelism. vPipe has two unique contributions, including (1) an online algorithm for searching a near-optimal layer partitioning and memory management plan, and (2) a live layer migration protocol for re-balancing the layer distribution across a training pipeline. vPipe improved the training throughput of two notable baselines (Pipedream and GPipe) by 61.4-463.4 percent and 24.8-291.3 percent on various large DNNs and training settings.</description><identifier>ISSN: 1045-9219</identifier><identifier>EISSN: 1558-2183</identifier><identifier>DOI: 10.1109/TPDS.2021.3094364</identifier><language>eng</language><publisher>New York: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</publisher><subject>Computer architecture ; Graphics processing units ; Machine vision ; Memory management ; Natural language processing ; Partitioning ; Search algorithms ; Training ; Transformers</subject><ispartof>IEEE transactions on parallel and distributed systems, 2022-01, Vol.33 (3), p.489</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27903,27904</link.rule.ids></links><search><creatorcontrib>Zhao, Shixiong</creatorcontrib><creatorcontrib>Li, Fanxin</creatorcontrib><creatorcontrib>Chen, Xusheng</creatorcontrib><creatorcontrib>Guan, Xiuxian</creatorcontrib><creatorcontrib>Jiang, Jianyu</creatorcontrib><creatorcontrib>Huang, Dong</creatorcontrib><creatorcontrib>Yuhao Qing</creatorcontrib><creatorcontrib>Wang, Sen</creatorcontrib><creatorcontrib>Wang, Peng</creatorcontrib><creatorcontrib>Zhang, Gong</creatorcontrib><creatorcontrib>Cheng, Li</creatorcontrib><creatorcontrib>Luo, Ping</creatorcontrib><creatorcontrib>Cui, Heming</creatorcontrib><title>v Pipe : A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training</title><title>IEEE transactions on parallel and distributed systems</title><description>The increasing computational complexity of DNNs achieved unprecedented successes in various areas such as machine vision and natural language processing (NLP), e.g., the recent advanced Transformer has billions of parameters. However, as large-scale DNNs significantly exceed GPU’s physical memory limit, they cannot be trained by conventional methods such as data parallelism. Pipeline parallelism that partitions a large DNN into small subnets and trains them on different GPUs is a plausible solution. Unfortunately, the layer partitioning and memory management in existing pipeline parallel systems are fixed during training, making them easily impeded by out-of-memory errors and the GPU under-utilization. These drawbacks amplify when performing neural architecture search (NAS) such as the evolved Transformer, where different network architectures of Transformer needed to be trained repeatedly. vPipe is the first system that transparently provides dynamic layer partitioning and memory management for pipeline parallelism. vPipe has two unique contributions, including (1) an online algorithm for searching a near-optimal layer partitioning and memory management plan, and (2) a live layer migration protocol for re-balancing the layer distribution across a training pipeline. vPipe improved the training throughput of two notable baselines (Pipedream and GPipe) by 61.4-463.4 percent and 24.8-291.3 percent on various large DNNs and training settings.</description><subject>Computer architecture</subject><subject>Graphics processing units</subject><subject>Machine vision</subject><subject>Memory management</subject><subject>Natural language processing</subject><subject>Partitioning</subject><subject>Search algorithms</subject><subject>Training</subject><subject>Transformers</subject><issn>1045-9219</issn><issn>1558-2183</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNqNjMtKAzEUhoMoWC8P4O6A6xlzMsnQcVdspatSmMFtidMzekrM1CRT0Kc3iA_g6v_4b0LcoSwRZfPQbZdtqaTCspKNrmp9JmZozLxQOK_OM0ttikZhcymuYjxIidpIPRP-BFs-EjzCAl44pMk6_qY9LPqeHAWbePTQfsVEHzCMIfvvTCf2b7AaBu6ZfALr99D21tlXR79vjn0GG6xz5GC52UAXLPu8uhEXg3WRbv_0Wtw_r7qndXEM4-dEMe0O4xR8jnbK1BIr1Lqu_tf6AecrT0g</recordid><startdate>20220101</startdate><enddate>20220101</enddate><creator>Zhao, Shixiong</creator><creator>Li, Fanxin</creator><creator>Chen, Xusheng</creator><creator>Guan, Xiuxian</creator><creator>Jiang, Jianyu</creator><creator>Huang, Dong</creator><creator>Yuhao Qing</creator><creator>Wang, Sen</creator><creator>Wang, Peng</creator><creator>Zhang, Gong</creator><creator>Cheng, Li</creator><creator>Luo, Ping</creator><creator>Cui, Heming</creator><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20220101</creationdate><title>v Pipe : A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training</title><author>Zhao, Shixiong ; Li, Fanxin ; Chen, Xusheng ; Guan, Xiuxian ; Jiang, Jianyu ; Huang, Dong ; Yuhao Qing ; Wang, Sen ; Wang, Peng ; Zhang, Gong ; Cheng, Li ; Luo, Ping ; Cui, Heming</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_25601314463</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer architecture</topic><topic>Graphics processing units</topic><topic>Machine vision</topic><topic>Memory management</topic><topic>Natural language processing</topic><topic>Partitioning</topic><topic>Search algorithms</topic><topic>Training</topic><topic>Transformers</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhao, Shixiong</creatorcontrib><creatorcontrib>Li, Fanxin</creatorcontrib><creatorcontrib>Chen, Xusheng</creatorcontrib><creatorcontrib>Guan, Xiuxian</creatorcontrib><creatorcontrib>Jiang, Jianyu</creatorcontrib><creatorcontrib>Huang, Dong</creatorcontrib><creatorcontrib>Yuhao Qing</creatorcontrib><creatorcontrib>Wang, Sen</creatorcontrib><creatorcontrib>Wang, Peng</creatorcontrib><creatorcontrib>Zhang, Gong</creatorcontrib><creatorcontrib>Cheng, Li</creatorcontrib><creatorcontrib>Luo, Ping</creatorcontrib><creatorcontrib>Cui, Heming</creatorcontrib><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on parallel and distributed systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhao, Shixiong</au><au>Li, Fanxin</au><au>Chen, Xusheng</au><au>Guan, Xiuxian</au><au>Jiang, Jianyu</au><au>Huang, Dong</au><au>Yuhao Qing</au><au>Wang, Sen</au><au>Wang, Peng</au><au>Zhang, Gong</au><au>Cheng, Li</au><au>Luo, Ping</au><au>Cui, Heming</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>v Pipe : A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training</atitle><jtitle>IEEE transactions on parallel and distributed systems</jtitle><date>2022-01-01</date><risdate>2022</risdate><volume>33</volume><issue>3</issue><spage>489</spage><pages>489-</pages><issn>1045-9219</issn><eissn>1558-2183</eissn><abstract>The increasing computational complexity of DNNs achieved unprecedented successes in various areas such as machine vision and natural language processing (NLP), e.g., the recent advanced Transformer has billions of parameters. However, as large-scale DNNs significantly exceed GPU’s physical memory limit, they cannot be trained by conventional methods such as data parallelism. Pipeline parallelism that partitions a large DNN into small subnets and trains them on different GPUs is a plausible solution. Unfortunately, the layer partitioning and memory management in existing pipeline parallel systems are fixed during training, making them easily impeded by out-of-memory errors and the GPU under-utilization. These drawbacks amplify when performing neural architecture search (NAS) such as the evolved Transformer, where different network architectures of Transformer needed to be trained repeatedly. vPipe is the first system that transparently provides dynamic layer partitioning and memory management for pipeline parallelism. vPipe has two unique contributions, including (1) an online algorithm for searching a near-optimal layer partitioning and memory management plan, and (2) a live layer migration protocol for re-balancing the layer distribution across a training pipeline. vPipe improved the training throughput of two notable baselines (Pipedream and GPipe) by 61.4-463.4 percent and 24.8-291.3 percent on various large DNNs and training settings.</abstract><cop>New York</cop><pub>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</pub><doi>10.1109/TPDS.2021.3094364</doi></addata></record>
fulltext	fulltext
identifier	ISSN: 1045-9219
ispartof	IEEE transactions on parallel and distributed systems, 2022-01, Vol.33 (3), p.489
issn	1045-9219 1558-2183
language	eng
recordid	cdi_proquest_journals_2560131446
source	IEEE Electronic Library (IEL)
subjects	Computer architecture Graphics processing units Machine vision Memory management Natural language processing Partitioning Search algorithms Training Transformers
title	v Pipe : A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-26T22%3A40%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=v%20Pipe%20:%20A%20Virtualized%20Acceleration%20System%20for%20Achieving%20Efficient%20and%20Scalable%20Pipeline%20Parallel%20DNN%20Training&rft.jtitle=IEEE%20transactions%20on%20parallel%20and%20distributed%20systems&rft.au=Zhao,%20Shixiong&rft.date=2022-01-01&rft.volume=33&rft.issue=3&rft.spage=489&rft.pages=489-&rft.issn=1045-9219&rft.eissn=1558-2183&rft_id=info:doi/10.1109/TPDS.2021.3094364&rft_dat=%3Cproquest%3E2560131446%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2560131446&rft_id=info:pmid/&rfr_iscdi=true