λDNN: Achieving Predictable Distributed DNN Training With Serverless Architectures

Serverless computing is becoming a promising paradigm for Distributed Deep Neural Network (DDNN) training in the cloud, as it allows users to decompose complex model training into a number of functions without managing virtual machines or servers. Though provided with a simpler resource interface (i...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on computers 2022-02, Vol.71 (2), p.450-463
Hauptverfasser:	Xu, Fei, Qin, Yiling, Chen, Li, Zhou, Zhi, Liu, Fangming
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial neural networks Bandwidth Central Processing Unit Cloud computing Computational modeling Distributed DNN training Empirical analysis function resource provisioning Performance prediction predictable performance Provisioning Resource allocation serverless computing Servers Throughput Training Virtual environments
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	463
container_issue	2
container_start_page	450
container_title	IEEE transactions on computers
container_volume	71
creator	Xu, Fei Qin, Yiling Chen, Li Zhou, Zhi Liu, Fangming
description	Serverless computing is becoming a promising paradigm for Distributed Deep Neural Network (DDNN) training in the cloud, as it allows users to decompose complex model training into a number of functions without managing virtual machines or servers. Though provided with a simpler resource interface (i.e., function number and memory size), inadequate function resource provisioning (either under-provisioning or over-provisioning) easily leads to unpredictable DDNN training performance in serverless platforms. Our empirical studies on AWS Lambda indicate that, such unpredictable performance of serverless DDNN training is mainly caused by the resource bottleneck of Parameter Servers (PS) and small local batch size. In this article, we design and implement \lambda λ DNN , a cost-efficient function resource provisioning framework to provide predictable performance for serverless DDNN training workloads, while saving the budget of provisioned functions. Leveraging the PS network bandwidth and function CPU utilization, we build a lightweight analytical DDNN training performance model to enable our design of \lambda λ DNN resource provisioning strategy, so as to guarantee DDNN training performance with serverless functions. Extensive prototype experiments on AWS Lambda and complementary trace-driven simulations demonstrate that, \lambda λ DNN can deliver predictable DDNN training performance and save the monetary cost of function resources by up to 66.7 percent, compared with the state-of-the-art resource provisioning strategies, yet with an acceptable runtime overhead.
doi_str_mv	10.1109/TC.2021.3054656
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_9336272</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9336272</ieee_id><sourcerecordid>2619589730</sourcerecordid><originalsourceid>FETCH-LOGICAL-c289t-fd2ae31cc354d59bf8ff368b80de88c2340b571087eab69add00d61cbcd0b60b3</originalsourceid><addsrcrecordid>eNo9kE1PAjEQQBujiYiePXjZxPPCtKXd1htZ_EoImrDG42bbzkoJAra7JP42_4O_ySUQT3N5b2byCLmmMKAU9LDIBwwYHXAQIynkCelRIbJUayFPSQ-AqlTzEZyTixiXACAZ6B6Z__5MZrO7ZGwXHnd-_ZG8BnTeNpVZYTLxsQnetA26pMOSIlR-vYfefbNI5hh2GFYYYzIOnd-gbdqA8ZKc1dUq4tVx9snbw32RP6XTl8fnfDxNLVO6SWvHKuTUWi5GTmhTq7rmUhkFDpWyrHvWiIyCyrAyUlfOAThJrbEOjATD--T2sHcbNl8txqZcbtqw7k6WTFItlM44dNTwQNmwiTFgXW6D_6zCd0mh3Jcri7zclyuP5Trj5mB4RPynNeeSZYz_ActTapY</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2619589730</pqid></control><display><type>article</type><title>λDNN: Achieving Predictable Distributed DNN Training With Serverless Architectures</title><source>IEEE Electronic Library (IEL)</source><creator>Xu, Fei ; Qin, Yiling ; Chen, Li ; Zhou, Zhi ; Liu, Fangming</creator><creatorcontrib>Xu, Fei ; Qin, Yiling ; Chen, Li ; Zhou, Zhi ; Liu, Fangming</creatorcontrib><description><![CDATA[Serverless computing is becoming a promising paradigm for Distributed Deep Neural Network (DDNN) training in the cloud, as it allows users to decompose complex model training into a number of functions without managing virtual machines or servers. Though provided with a simpler resource interface (i.e., function number and memory size), inadequate function resource provisioning (either under-provisioning or over-provisioning) easily leads to unpredictable DDNN training performance in serverless platforms. Our empirical studies on AWS Lambda indicate that, such unpredictable performance of serverless DDNN training is mainly caused by the resource bottleneck of Parameter Servers (PS) and small local batch size. In this article, we design and implement <inline-formula><tex-math notation="LaTeX">\lambda</tex-math> <mml:math><mml:mi>λ</mml:mi></mml:math><inline-graphic xlink:href="xu-ieq1-3054656.gif"/> </inline-formula>DNN , a cost-efficient function resource provisioning framework to provide predictable performance for serverless DDNN training workloads, while saving the budget of provisioned functions. Leveraging the PS network bandwidth and function CPU utilization, we build a lightweight analytical DDNN training performance model to enable our design of <inline-formula><tex-math notation="LaTeX">\lambda</tex-math> <mml:math><mml:mi>λ</mml:mi></mml:math><inline-graphic xlink:href="xu-ieq2-3054656.gif"/> </inline-formula>DNN resource provisioning strategy, so as to guarantee DDNN training performance with serverless functions. Extensive prototype experiments on AWS Lambda and complementary trace-driven simulations demonstrate that, <inline-formula><tex-math notation="LaTeX">\lambda</tex-math> <mml:math><mml:mi>λ</mml:mi></mml:math><inline-graphic xlink:href="xu-ieq3-3054656.gif"/> </inline-formula>DNN can deliver predictable DDNN training performance and save the monetary cost of function resources by up to 66.7 percent, compared with the state-of-the-art resource provisioning strategies, yet with an acceptable runtime overhead.]]></description><identifier>ISSN: 0018-9340</identifier><identifier>EISSN: 1557-9956</identifier><identifier>DOI: 10.1109/TC.2021.3054656</identifier><identifier>CODEN: ITCOB4</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Artificial neural networks ; Bandwidth ; Central Processing Unit ; Cloud computing ; Computational modeling ; Distributed DNN training ; Empirical analysis ; function resource provisioning ; Performance prediction ; predictable performance ; Provisioning ; Resource allocation ; serverless computing ; Servers ; Throughput ; Training ; Virtual environments</subject><ispartof>IEEE transactions on computers, 2022-02, Vol.71 (2), p.450-463</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c289t-fd2ae31cc354d59bf8ff368b80de88c2340b571087eab69add00d61cbcd0b60b3</citedby><cites>FETCH-LOGICAL-c289t-fd2ae31cc354d59bf8ff368b80de88c2340b571087eab69add00d61cbcd0b60b3</cites><orcidid>0000-0002-0987-9344 ; 0000-0002-2300-6996 ; 0000-0003-1590-5323 ; 0000-0002-8570-1345</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9336272$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9336272$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Xu, Fei</creatorcontrib><creatorcontrib>Qin, Yiling</creatorcontrib><creatorcontrib>Chen, Li</creatorcontrib><creatorcontrib>Zhou, Zhi</creatorcontrib><creatorcontrib>Liu, Fangming</creatorcontrib><title>λDNN: Achieving Predictable Distributed DNN Training With Serverless Architectures</title><title>IEEE transactions on computers</title><addtitle>TC</addtitle><description><![CDATA[Serverless computing is becoming a promising paradigm for Distributed Deep Neural Network (DDNN) training in the cloud, as it allows users to decompose complex model training into a number of functions without managing virtual machines or servers. Though provided with a simpler resource interface (i.e., function number and memory size), inadequate function resource provisioning (either under-provisioning or over-provisioning) easily leads to unpredictable DDNN training performance in serverless platforms. Our empirical studies on AWS Lambda indicate that, such unpredictable performance of serverless DDNN training is mainly caused by the resource bottleneck of Parameter Servers (PS) and small local batch size. In this article, we design and implement <inline-formula><tex-math notation="LaTeX">\lambda</tex-math> <mml:math><mml:mi>λ</mml:mi></mml:math><inline-graphic xlink:href="xu-ieq1-3054656.gif"/> </inline-formula>DNN , a cost-efficient function resource provisioning framework to provide predictable performance for serverless DDNN training workloads, while saving the budget of provisioned functions. Leveraging the PS network bandwidth and function CPU utilization, we build a lightweight analytical DDNN training performance model to enable our design of <inline-formula><tex-math notation="LaTeX">\lambda</tex-math> <mml:math><mml:mi>λ</mml:mi></mml:math><inline-graphic xlink:href="xu-ieq2-3054656.gif"/> </inline-formula>DNN resource provisioning strategy, so as to guarantee DDNN training performance with serverless functions. Extensive prototype experiments on AWS Lambda and complementary trace-driven simulations demonstrate that, <inline-formula><tex-math notation="LaTeX">\lambda</tex-math> <mml:math><mml:mi>λ</mml:mi></mml:math><inline-graphic xlink:href="xu-ieq3-3054656.gif"/> </inline-formula>DNN can deliver predictable DDNN training performance and save the monetary cost of function resources by up to 66.7 percent, compared with the state-of-the-art resource provisioning strategies, yet with an acceptable runtime overhead.]]></description><subject>Artificial neural networks</subject><subject>Bandwidth</subject><subject>Central Processing Unit</subject><subject>Cloud computing</subject><subject>Computational modeling</subject><subject>Distributed DNN training</subject><subject>Empirical analysis</subject><subject>function resource provisioning</subject><subject>Performance prediction</subject><subject>predictable performance</subject><subject>Provisioning</subject><subject>Resource allocation</subject><subject>serverless computing</subject><subject>Servers</subject><subject>Throughput</subject><subject>Training</subject><subject>Virtual environments</subject><issn>0018-9340</issn><issn>1557-9956</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kE1PAjEQQBujiYiePXjZxPPCtKXd1htZ_EoImrDG42bbzkoJAra7JP42_4O_ySUQT3N5b2byCLmmMKAU9LDIBwwYHXAQIynkCelRIbJUayFPSQ-AqlTzEZyTixiXACAZ6B6Z__5MZrO7ZGwXHnd-_ZG8BnTeNpVZYTLxsQnetA26pMOSIlR-vYfefbNI5hh2GFYYYzIOnd-gbdqA8ZKc1dUq4tVx9snbw32RP6XTl8fnfDxNLVO6SWvHKuTUWi5GTmhTq7rmUhkFDpWyrHvWiIyCyrAyUlfOAThJrbEOjATD--T2sHcbNl8txqZcbtqw7k6WTFItlM44dNTwQNmwiTFgXW6D_6zCd0mh3Jcri7zclyuP5Trj5mB4RPynNeeSZYz_ActTapY</recordid><startdate>20220201</startdate><enddate>20220201</enddate><creator>Xu, Fei</creator><creator>Qin, Yiling</creator><creator>Chen, Li</creator><creator>Zhou, Zhi</creator><creator>Liu, Fangming</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-0987-9344</orcidid><orcidid>https://orcid.org/0000-0002-2300-6996</orcidid><orcidid>https://orcid.org/0000-0003-1590-5323</orcidid><orcidid>https://orcid.org/0000-0002-8570-1345</orcidid></search><sort><creationdate>20220201</creationdate><title>λDNN: Achieving Predictable Distributed DNN Training With Serverless Architectures</title><author>Xu, Fei ; Qin, Yiling ; Chen, Li ; Zhou, Zhi ; Liu, Fangming</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c289t-fd2ae31cc354d59bf8ff368b80de88c2340b571087eab69add00d61cbcd0b60b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Artificial neural networks</topic><topic>Bandwidth</topic><topic>Central Processing Unit</topic><topic>Cloud computing</topic><topic>Computational modeling</topic><topic>Distributed DNN training</topic><topic>Empirical analysis</topic><topic>function resource provisioning</topic><topic>Performance prediction</topic><topic>predictable performance</topic><topic>Provisioning</topic><topic>Resource allocation</topic><topic>serverless computing</topic><topic>Servers</topic><topic>Throughput</topic><topic>Training</topic><topic>Virtual environments</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Xu, Fei</creatorcontrib><creatorcontrib>Qin, Yiling</creatorcontrib><creatorcontrib>Chen, Li</creatorcontrib><creatorcontrib>Zhou, Zhi</creatorcontrib><creatorcontrib>Liu, Fangming</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on computers</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Xu, Fei</au><au>Qin, Yiling</au><au>Chen, Li</au><au>Zhou, Zhi</au><au>Liu, Fangming</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>λDNN: Achieving Predictable Distributed DNN Training With Serverless Architectures</atitle><jtitle>IEEE transactions on computers</jtitle><stitle>TC</stitle><date>2022-02-01</date><risdate>2022</risdate><volume>71</volume><issue>2</issue><spage>450</spage><epage>463</epage><pages>450-463</pages><issn>0018-9340</issn><eissn>1557-9956</eissn><coden>ITCOB4</coden><abstract><![CDATA[Serverless computing is becoming a promising paradigm for Distributed Deep Neural Network (DDNN) training in the cloud, as it allows users to decompose complex model training into a number of functions without managing virtual machines or servers. Though provided with a simpler resource interface (i.e., function number and memory size), inadequate function resource provisioning (either under-provisioning or over-provisioning) easily leads to unpredictable DDNN training performance in serverless platforms. Our empirical studies on AWS Lambda indicate that, such unpredictable performance of serverless DDNN training is mainly caused by the resource bottleneck of Parameter Servers (PS) and small local batch size. In this article, we design and implement <inline-formula><tex-math notation="LaTeX">\lambda</tex-math> <mml:math><mml:mi>λ</mml:mi></mml:math><inline-graphic xlink:href="xu-ieq1-3054656.gif"/> </inline-formula>DNN , a cost-efficient function resource provisioning framework to provide predictable performance for serverless DDNN training workloads, while saving the budget of provisioned functions. Leveraging the PS network bandwidth and function CPU utilization, we build a lightweight analytical DDNN training performance model to enable our design of <inline-formula><tex-math notation="LaTeX">\lambda</tex-math> <mml:math><mml:mi>λ</mml:mi></mml:math><inline-graphic xlink:href="xu-ieq2-3054656.gif"/> </inline-formula>DNN resource provisioning strategy, so as to guarantee DDNN training performance with serverless functions. Extensive prototype experiments on AWS Lambda and complementary trace-driven simulations demonstrate that, <inline-formula><tex-math notation="LaTeX">\lambda</tex-math> <mml:math><mml:mi>λ</mml:mi></mml:math><inline-graphic xlink:href="xu-ieq3-3054656.gif"/> </inline-formula>DNN can deliver predictable DDNN training performance and save the monetary cost of function resources by up to 66.7 percent, compared with the state-of-the-art resource provisioning strategies, yet with an acceptable runtime overhead.]]></abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TC.2021.3054656</doi><tpages>14</tpages><orcidid>https://orcid.org/0000-0002-0987-9344</orcidid><orcidid>https://orcid.org/0000-0002-2300-6996</orcidid><orcidid>https://orcid.org/0000-0003-1590-5323</orcidid><orcidid>https://orcid.org/0000-0002-8570-1345</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 0018-9340
ispartof	IEEE transactions on computers, 2022-02, Vol.71 (2), p.450-463
issn	0018-9340 1557-9956
language	eng
recordid	cdi_ieee_primary_9336272
source	IEEE Electronic Library (IEL)
subjects	Artificial neural networks Bandwidth Central Processing Unit Cloud computing Computational modeling Distributed DNN training Empirical analysis function resource provisioning Performance prediction predictable performance Provisioning Resource allocation serverless computing Servers Throughput Training Virtual environments
title	λDNN: Achieving Predictable Distributed DNN Training With Serverless Architectures
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T01%3A46%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=%CE%BBDNN:%20Achieving%20Predictable%20Distributed%20DNN%20Training%20With%20Serverless%20Architectures&rft.jtitle=IEEE%20transactions%20on%20computers&rft.au=Xu,%20Fei&rft.date=2022-02-01&rft.volume=71&rft.issue=2&rft.spage=450&rft.epage=463&rft.pages=450-463&rft.issn=0018-9340&rft.eissn=1557-9956&rft.coden=ITCOB4&rft_id=info:doi/10.1109/TC.2021.3054656&rft_dat=%3Cproquest_RIE%3E2619589730%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2619589730&rft_id=info:pmid/&rft_ieee_id=9336272&rfr_iscdi=true