GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path s...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2020-06
Hauptverfasser:	Lepikhin, Dmitry, Lee, HyoukJoong, Xu, Yuanzhong, Chen, Dehao, Firat, Orhan, Huang, Yanping, Krikun, Maxim, Shazeer, Noam, Chen, Zhifeng
Format:	Artikel
Sprache:	eng
Schlagworte:	Accelerators Annotations Machine learning Machine translation Neural networks Parallel processing
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Lepikhin, Dmitry Lee, HyoukJoong Xu, Yuanzhong Chen, Dehao Firat, Orhan Huang, Yanping Krikun, Maxim Shazeer, Noam Chen, Zhifeng
description	Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2419237339</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2419237339</sourcerecordid><originalsourceid>FETCH-proquest_journals_24192373393</originalsourceid><addsrcrecordid>eNqNjNEKgjAYRkcQJOU7_NC1oP80s7uQspsgqHsZznIyN3MbvX4reoCuvu_A4cxIgJQm0TZFXJDQmD6OY9zkmGU0IJfq2rGJ7-DaMCnUAyrBlIWz5q008BK2g1IrLqzQikn_h9FZ9iFgisPeWT14bOCb8YEVmd-ZNG342yVZHw-38hSNk3661ti6127yLVNjmhRIc0oL-p_1Bh6RPqM</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2419237339</pqid></control><display><type>article</type><title>GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding</title><source>Free E- Journals</source><creator>Lepikhin, Dmitry ; Lee, HyoukJoong ; Xu, Yuanzhong ; Chen, Dehao ; Firat, Orhan ; Huang, Yanping ; Krikun, Maxim ; Shazeer, Noam ; Chen, Zhifeng</creator><creatorcontrib>Lepikhin, Dmitry ; Lee, HyoukJoong ; Xu, Yuanzhong ; Chen, Dehao ; Firat, Orhan ; Huang, Yanping ; Krikun, Maxim ; Shazeer, Noam ; Chen, Zhifeng</creatorcontrib><description>Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Accelerators ; Annotations ; Machine learning ; Machine translation ; Neural networks ; Parallel processing</subject><ispartof>arXiv.org, 2020-06</ispartof><rights>2020. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Lepikhin, Dmitry</creatorcontrib><creatorcontrib>Lee, HyoukJoong</creatorcontrib><creatorcontrib>Xu, Yuanzhong</creatorcontrib><creatorcontrib>Chen, Dehao</creatorcontrib><creatorcontrib>Firat, Orhan</creatorcontrib><creatorcontrib>Huang, Yanping</creatorcontrib><creatorcontrib>Krikun, Maxim</creatorcontrib><creatorcontrib>Shazeer, Noam</creatorcontrib><creatorcontrib>Chen, Zhifeng</creatorcontrib><title>GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding</title><title>arXiv.org</title><description>Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.</description><subject>Accelerators</subject><subject>Annotations</subject><subject>Machine learning</subject><subject>Machine translation</subject><subject>Neural networks</subject><subject>Parallel processing</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNjNEKgjAYRkcQJOU7_NC1oP80s7uQspsgqHsZznIyN3MbvX4reoCuvu_A4cxIgJQm0TZFXJDQmD6OY9zkmGU0IJfq2rGJ7-DaMCnUAyrBlIWz5q008BK2g1IrLqzQikn_h9FZ9iFgisPeWT14bOCb8YEVmd-ZNG342yVZHw-38hSNk3661ti6127yLVNjmhRIc0oL-p_1Bh6RPqM</recordid><startdate>20200630</startdate><enddate>20200630</enddate><creator>Lepikhin, Dmitry</creator><creator>Lee, HyoukJoong</creator><creator>Xu, Yuanzhong</creator><creator>Chen, Dehao</creator><creator>Firat, Orhan</creator><creator>Huang, Yanping</creator><creator>Krikun, Maxim</creator><creator>Shazeer, Noam</creator><creator>Chen, Zhifeng</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20200630</creationdate><title>GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding</title><author>Lepikhin, Dmitry ; Lee, HyoukJoong ; Xu, Yuanzhong ; Chen, Dehao ; Firat, Orhan ; Huang, Yanping ; Krikun, Maxim ; Shazeer, Noam ; Chen, Zhifeng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_24192373393</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Accelerators</topic><topic>Annotations</topic><topic>Machine learning</topic><topic>Machine translation</topic><topic>Neural networks</topic><topic>Parallel processing</topic><toplevel>online_resources</toplevel><creatorcontrib>Lepikhin, Dmitry</creatorcontrib><creatorcontrib>Lee, HyoukJoong</creatorcontrib><creatorcontrib>Xu, Yuanzhong</creatorcontrib><creatorcontrib>Chen, Dehao</creatorcontrib><creatorcontrib>Firat, Orhan</creatorcontrib><creatorcontrib>Huang, Yanping</creatorcontrib><creatorcontrib>Krikun, Maxim</creatorcontrib><creatorcontrib>Shazeer, Noam</creatorcontrib><creatorcontrib>Chen, Zhifeng</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Lepikhin, Dmitry</au><au>Lee, HyoukJoong</au><au>Xu, Yuanzhong</au><au>Chen, Dehao</au><au>Firat, Orhan</au><au>Huang, Yanping</au><au>Krikun, Maxim</au><au>Shazeer, Noam</au><au>Chen, Zhifeng</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding</atitle><jtitle>arXiv.org</jtitle><date>2020-06-30</date><risdate>2020</risdate><eissn>2331-8422</eissn><abstract>Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2020-06
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2419237339
source	Free E- Journals
subjects	Accelerators Annotations Machine learning Machine translation Neural networks Parallel processing
title	GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-31T04%3A50%3A39IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=GShard:%20Scaling%20Giant%20Models%20with%20Conditional%20Computation%20and%20Automatic%20Sharding&rft.jtitle=arXiv.org&rft.au=Lepikhin,%20Dmitry&rft.date=2020-06-30&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2419237339%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2419237339&rft_id=info:pmid/&rfr_iscdi=true