Transcending Scaling Laws with 0.1% Extra Compute

Scaling language models improves performance but comes with significant computational costs. This paper proposes UL2R, a method that substantially improves existing language models and their scaling curves with a relatively tiny amount of extra compute. The key idea is to continue training a state-o...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2022-11
Hauptverfasser:	Tay, Yi, Wei, Jason, Chung, Hyung Won, Tran, Vinh Q, So, David R, Shakeri, Siamak, Garcia, Xavier, Zheng, Huaixiu Steven, Rao, Jinfeng, Chowdhery, Aakanksha, Zhou, Denny, Metzler, Donald, Petrov, Slav, Houlsby, Neil, Le, Quoc V, Dehghani, Mostafa
Format:	Artikel
Sprache:	eng
Schlagworte:	Computing costs Cost control Performance enhancement Reasoning Scaling laws Training
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Tay, Yi Wei, Jason Chung, Hyung Won Tran, Vinh Q So, David R Shakeri, Siamak Garcia, Xavier Zheng, Huaixiu Steven Rao, Jinfeng Chowdhery, Aakanksha Zhou, Denny Metzler, Donald Petrov, Slav Houlsby, Neil Le, Quoc V Dehghani, Mostafa
description	Scaling language models improves performance but comes with significant computational costs. This paper proposes UL2R, a method that substantially improves existing language models and their scaling curves with a relatively tiny amount of extra compute. The key idea is to continue training a state-of-the-art large language model (e.g., PaLM) on a few more steps with UL2's mixture-of-denoiser objective. We show that, with almost negligible extra computational costs and no new sources of data, we are able to substantially improve the scaling properties of large language models on downstream metrics. In this paper, we continue training PaLM with UL2R, introducing a new set of models at 8B, 62B, and 540B scale which we call U-PaLM. Impressively, at 540B scale, we show an approximately 2x computational savings rate where U-PaLM achieves the same performance as the final PaLM 540B model at around half its computational budget (i.e., saving $\sim$4.4 million TPUv4 hours). We further show that this improved scaling curve leads to 'emergent abilities' on challenging BIG-Bench tasks -- for instance, U-PaLM does much better than PaLM on some tasks or demonstrates better quality at much smaller scale (62B as opposed to 540B). Overall, we show that U-PaLM outperforms PaLM on many few-shot setups, i.e., English NLP tasks (e.g., commonsense reasoning, question answering), reasoning tasks with chain-of-thought (e.g., GSM8K), multilingual tasks (MGSM, TydiQA), MMLU and challenging BIG-Bench tasks. Finally, we provide qualitative examples showing the new capabilities of U-PaLM for single and multi-span infilling.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2727082420</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2727082420</sourcerecordid><originalsourceid>FETCH-proquest_journals_27270824203</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mQwDClKzCtOTs1LycxLVwhOTswB0T6J5cUK5ZklGQoGeoaqCq4VJUWJCs75uQWlJak8DKxpiTnFqbxQmptB2c01xNlDt6Aov7A0tbgkPiu_tCgPKBVvZG5kbmBhZGJkYEycKgAocjJD</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2727082420</pqid></control><display><type>article</type><title>Transcending Scaling Laws with 0.1% Extra Compute</title><source>Free E- Journals</source><creator>Tay, Yi ; Wei, Jason ; Chung, Hyung Won ; Tran, Vinh Q ; So, David R ; Shakeri, Siamak ; Garcia, Xavier ; Zheng, Huaixiu Steven ; Rao, Jinfeng ; Chowdhery, Aakanksha ; Zhou, Denny ; Metzler, Donald ; Petrov, Slav ; Houlsby, Neil ; Le, Quoc V ; Dehghani, Mostafa</creator><creatorcontrib>Tay, Yi ; Wei, Jason ; Chung, Hyung Won ; Tran, Vinh Q ; So, David R ; Shakeri, Siamak ; Garcia, Xavier ; Zheng, Huaixiu Steven ; Rao, Jinfeng ; Chowdhery, Aakanksha ; Zhou, Denny ; Metzler, Donald ; Petrov, Slav ; Houlsby, Neil ; Le, Quoc V ; Dehghani, Mostafa</creatorcontrib><description>Scaling language models improves performance but comes with significant computational costs. This paper proposes UL2R, a method that substantially improves existing language models and their scaling curves with a relatively tiny amount of extra compute. The key idea is to continue training a state-of-the-art large language model (e.g., PaLM) on a few more steps with UL2's mixture-of-denoiser objective. We show that, with almost negligible extra computational costs and no new sources of data, we are able to substantially improve the scaling properties of large language models on downstream metrics. In this paper, we continue training PaLM with UL2R, introducing a new set of models at 8B, 62B, and 540B scale which we call U-PaLM. Impressively, at 540B scale, we show an approximately 2x computational savings rate where U-PaLM achieves the same performance as the final PaLM 540B model at around half its computational budget (i.e., saving $\sim$4.4 million TPUv4 hours). We further show that this improved scaling curve leads to 'emergent abilities' on challenging BIG-Bench tasks -- for instance, U-PaLM does much better than PaLM on some tasks or demonstrates better quality at much smaller scale (62B as opposed to 540B). Overall, we show that U-PaLM outperforms PaLM on many few-shot setups, i.e., English NLP tasks (e.g., commonsense reasoning, question answering), reasoning tasks with chain-of-thought (e.g., GSM8K), multilingual tasks (MGSM, TydiQA), MMLU and challenging BIG-Bench tasks. Finally, we provide qualitative examples showing the new capabilities of U-PaLM for single and multi-span infilling.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Computing costs ; Cost control ; Performance enhancement ; Reasoning ; Scaling laws ; Training</subject><ispartof>arXiv.org, 2022-11</ispartof><rights>2022. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Tay, Yi</creatorcontrib><creatorcontrib>Wei, Jason</creatorcontrib><creatorcontrib>Chung, Hyung Won</creatorcontrib><creatorcontrib>Tran, Vinh Q</creatorcontrib><creatorcontrib>So, David R</creatorcontrib><creatorcontrib>Shakeri, Siamak</creatorcontrib><creatorcontrib>Garcia, Xavier</creatorcontrib><creatorcontrib>Zheng, Huaixiu Steven</creatorcontrib><creatorcontrib>Rao, Jinfeng</creatorcontrib><creatorcontrib>Chowdhery, Aakanksha</creatorcontrib><creatorcontrib>Zhou, Denny</creatorcontrib><creatorcontrib>Metzler, Donald</creatorcontrib><creatorcontrib>Petrov, Slav</creatorcontrib><creatorcontrib>Houlsby, Neil</creatorcontrib><creatorcontrib>Le, Quoc V</creatorcontrib><creatorcontrib>Dehghani, Mostafa</creatorcontrib><title>Transcending Scaling Laws with 0.1% Extra Compute</title><title>arXiv.org</title><description>Scaling language models improves performance but comes with significant computational costs. This paper proposes UL2R, a method that substantially improves existing language models and their scaling curves with a relatively tiny amount of extra compute. The key idea is to continue training a state-of-the-art large language model (e.g., PaLM) on a few more steps with UL2's mixture-of-denoiser objective. We show that, with almost negligible extra computational costs and no new sources of data, we are able to substantially improve the scaling properties of large language models on downstream metrics. In this paper, we continue training PaLM with UL2R, introducing a new set of models at 8B, 62B, and 540B scale which we call U-PaLM. Impressively, at 540B scale, we show an approximately 2x computational savings rate where U-PaLM achieves the same performance as the final PaLM 540B model at around half its computational budget (i.e., saving $\sim$4.4 million TPUv4 hours). We further show that this improved scaling curve leads to 'emergent abilities' on challenging BIG-Bench tasks -- for instance, U-PaLM does much better than PaLM on some tasks or demonstrates better quality at much smaller scale (62B as opposed to 540B). Overall, we show that U-PaLM outperforms PaLM on many few-shot setups, i.e., English NLP tasks (e.g., commonsense reasoning, question answering), reasoning tasks with chain-of-thought (e.g., GSM8K), multilingual tasks (MGSM, TydiQA), MMLU and challenging BIG-Bench tasks. Finally, we provide qualitative examples showing the new capabilities of U-PaLM for single and multi-span infilling.</description><subject>Computing costs</subject><subject>Cost control</subject><subject>Performance enhancement</subject><subject>Reasoning</subject><subject>Scaling laws</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mQwDClKzCtOTs1LycxLVwhOTswB0T6J5cUK5ZklGQoGeoaqCq4VJUWJCs75uQWlJak8DKxpiTnFqbxQmptB2c01xNlDt6Aov7A0tbgkPiu_tCgPKBVvZG5kbmBhZGJkYEycKgAocjJD</recordid><startdate>20221116</startdate><enddate>20221116</enddate><creator>Tay, Yi</creator><creator>Wei, Jason</creator><creator>Chung, Hyung Won</creator><creator>Tran, Vinh Q</creator><creator>So, David R</creator><creator>Shakeri, Siamak</creator><creator>Garcia, Xavier</creator><creator>Zheng, Huaixiu Steven</creator><creator>Rao, Jinfeng</creator><creator>Chowdhery, Aakanksha</creator><creator>Zhou, Denny</creator><creator>Metzler, Donald</creator><creator>Petrov, Slav</creator><creator>Houlsby, Neil</creator><creator>Le, Quoc V</creator><creator>Dehghani, Mostafa</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20221116</creationdate><title>Transcending Scaling Laws with 0.1% Extra Compute</title><author>Tay, Yi ; Wei, Jason ; Chung, Hyung Won ; Tran, Vinh Q ; So, David R ; Shakeri, Siamak ; Garcia, Xavier ; Zheng, Huaixiu Steven ; Rao, Jinfeng ; Chowdhery, Aakanksha ; Zhou, Denny ; Metzler, Donald ; Petrov, Slav ; Houlsby, Neil ; Le, Quoc V ; Dehghani, Mostafa</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_27270824203</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computing costs</topic><topic>Cost control</topic><topic>Performance enhancement</topic><topic>Reasoning</topic><topic>Scaling laws</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Tay, Yi</creatorcontrib><creatorcontrib>Wei, Jason</creatorcontrib><creatorcontrib>Chung, Hyung Won</creatorcontrib><creatorcontrib>Tran, Vinh Q</creatorcontrib><creatorcontrib>So, David R</creatorcontrib><creatorcontrib>Shakeri, Siamak</creatorcontrib><creatorcontrib>Garcia, Xavier</creatorcontrib><creatorcontrib>Zheng, Huaixiu Steven</creatorcontrib><creatorcontrib>Rao, Jinfeng</creatorcontrib><creatorcontrib>Chowdhery, Aakanksha</creatorcontrib><creatorcontrib>Zhou, Denny</creatorcontrib><creatorcontrib>Metzler, Donald</creatorcontrib><creatorcontrib>Petrov, Slav</creatorcontrib><creatorcontrib>Houlsby, Neil</creatorcontrib><creatorcontrib>Le, Quoc V</creatorcontrib><creatorcontrib>Dehghani, Mostafa</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Tay, Yi</au><au>Wei, Jason</au><au>Chung, Hyung Won</au><au>Tran, Vinh Q</au><au>So, David R</au><au>Shakeri, Siamak</au><au>Garcia, Xavier</au><au>Zheng, Huaixiu Steven</au><au>Rao, Jinfeng</au><au>Chowdhery, Aakanksha</au><au>Zhou, Denny</au><au>Metzler, Donald</au><au>Petrov, Slav</au><au>Houlsby, Neil</au><au>Le, Quoc V</au><au>Dehghani, Mostafa</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Transcending Scaling Laws with 0.1% Extra Compute</atitle><jtitle>arXiv.org</jtitle><date>2022-11-16</date><risdate>2022</risdate><eissn>2331-8422</eissn><abstract>Scaling language models improves performance but comes with significant computational costs. This paper proposes UL2R, a method that substantially improves existing language models and their scaling curves with a relatively tiny amount of extra compute. The key idea is to continue training a state-of-the-art large language model (e.g., PaLM) on a few more steps with UL2's mixture-of-denoiser objective. We show that, with almost negligible extra computational costs and no new sources of data, we are able to substantially improve the scaling properties of large language models on downstream metrics. In this paper, we continue training PaLM with UL2R, introducing a new set of models at 8B, 62B, and 540B scale which we call U-PaLM. Impressively, at 540B scale, we show an approximately 2x computational savings rate where U-PaLM achieves the same performance as the final PaLM 540B model at around half its computational budget (i.e., saving $\sim$4.4 million TPUv4 hours). We further show that this improved scaling curve leads to 'emergent abilities' on challenging BIG-Bench tasks -- for instance, U-PaLM does much better than PaLM on some tasks or demonstrates better quality at much smaller scale (62B as opposed to 540B). Overall, we show that U-PaLM outperforms PaLM on many few-shot setups, i.e., English NLP tasks (e.g., commonsense reasoning, question answering), reasoning tasks with chain-of-thought (e.g., GSM8K), multilingual tasks (MGSM, TydiQA), MMLU and challenging BIG-Bench tasks. Finally, we provide qualitative examples showing the new capabilities of U-PaLM for single and multi-span infilling.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2022-11
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2727082420
source	Free E- Journals
subjects	Computing costs Cost control Performance enhancement Reasoning Scaling laws Training
title	Transcending Scaling Laws with 0.1% Extra Compute
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-10T04%3A35%3A18IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Transcending%20Scaling%20Laws%20with%200.1%25%20Extra%20Compute&rft.jtitle=arXiv.org&rft.au=Tay,%20Yi&rft.date=2022-11-16&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2727082420%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2727082420&rft_id=info:pmid/&rfr_iscdi=true