Scaling Law for Language Models Training Considering Batch Size

Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress. In this paper, we empirically investigate how a critical hyper-parameter, i.e., the global batch size, influences the LLM training prdocess. We begin by train...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Shuai, Xian, Wang, Yiding, Wu, Yimeng, Jiang, Xin, Ren, Xiaozhe
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Shuai, Xian Wang, Yiding Wu, Yimeng Jiang, Xin Ren, Xiaozhe
description	Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress. In this paper, we empirically investigate how a critical hyper-parameter, i.e., the global batch size, influences the LLM training prdocess. We begin by training language models ranging from 125 million to 2.6 billion parameters, using up to 300 billion high-quality tokens. Through these experiments, we establish a basic scaling law on model size and training data amount. We then examine how varying batch sizes and learning rates affect the convergence and generalization of these models. Our analysis yields batch size scaling laws under two different cases: with a fixed compute budget, and with a fixed amount of training data. Extrapolation experiments on models of increasing sizes validate our predicted laws, which provides guidance for optimizing LLM training strategies under specific resource constraints.
doi_str_mv	10.48550/arxiv.2412.01505
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2412_01505</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2412_01505</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2412_015053</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjE00jMwNDUw5WSwD05OzMnMS1fwSSxXSMsvAtJ56aWJ6akKvvkpqTnFCiFFiZl5IAXO-XnFmSmpRSC2U2JJcoZCcGZVKg8Da1piTnEqL5TmZpB3cw1x9tAFWxVfUJSZm1hUGQ-yMh5spTFhFQBgiTWR</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Scaling Law for Language Models Training Considering Batch Size</title><source>arXiv.org</source><creator>Shuai, Xian ; Wang, Yiding ; Wu, Yimeng ; Jiang, Xin ; Ren, Xiaozhe</creator><creatorcontrib>Shuai, Xian ; Wang, Yiding ; Wu, Yimeng ; Jiang, Xin ; Ren, Xiaozhe</creatorcontrib><description>Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress. In this paper, we empirically investigate how a critical hyper-parameter, i.e., the global batch size, influences the LLM training prdocess. We begin by training language models ranging from 125 million to 2.6 billion parameters, using up to 300 billion high-quality tokens. Through these experiments, we establish a basic scaling law on model size and training data amount. We then examine how varying batch sizes and learning rates affect the convergence and generalization of these models. Our analysis yields batch size scaling laws under two different cases: with a fixed compute budget, and with a fixed amount of training data. Extrapolation experiments on models of increasing sizes validate our predicted laws, which provides guidance for optimizing LLM training strategies under specific resource constraints.</description><identifier>DOI: 10.48550/arxiv.2412.01505</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Learning</subject><creationdate>2024-12</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2412.01505$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2412.01505$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Shuai, Xian</creatorcontrib><creatorcontrib>Wang, Yiding</creatorcontrib><creatorcontrib>Wu, Yimeng</creatorcontrib><creatorcontrib>Jiang, Xin</creatorcontrib><creatorcontrib>Ren, Xiaozhe</creatorcontrib><title>Scaling Law for Language Models Training Considering Batch Size</title><description>Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress. In this paper, we empirically investigate how a critical hyper-parameter, i.e., the global batch size, influences the LLM training prdocess. We begin by training language models ranging from 125 million to 2.6 billion parameters, using up to 300 billion high-quality tokens. Through these experiments, we establish a basic scaling law on model size and training data amount. We then examine how varying batch sizes and learning rates affect the convergence and generalization of these models. Our analysis yields batch size scaling laws under two different cases: with a fixed compute budget, and with a fixed amount of training data. Extrapolation experiments on models of increasing sizes validate our predicted laws, which provides guidance for optimizing LLM training strategies under specific resource constraints.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjE00jMwNDUw5WSwD05OzMnMS1fwSSxXSMsvAtJ56aWJ6akKvvkpqTnFCiFFiZl5IAXO-XnFmSmpRSC2U2JJcoZCcGZVKg8Da1piTnEqL5TmZpB3cw1x9tAFWxVfUJSZm1hUGQ-yMh5spTFhFQBgiTWR</recordid><startdate>20241202</startdate><enddate>20241202</enddate><creator>Shuai, Xian</creator><creator>Wang, Yiding</creator><creator>Wu, Yimeng</creator><creator>Jiang, Xin</creator><creator>Ren, Xiaozhe</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241202</creationdate><title>Scaling Law for Language Models Training Considering Batch Size</title><author>Shuai, Xian ; Wang, Yiding ; Wu, Yimeng ; Jiang, Xin ; Ren, Xiaozhe</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2412_015053</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Shuai, Xian</creatorcontrib><creatorcontrib>Wang, Yiding</creatorcontrib><creatorcontrib>Wu, Yimeng</creatorcontrib><creatorcontrib>Jiang, Xin</creatorcontrib><creatorcontrib>Ren, Xiaozhe</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Shuai, Xian</au><au>Wang, Yiding</au><au>Wu, Yimeng</au><au>Jiang, Xin</au><au>Ren, Xiaozhe</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Scaling Law for Language Models Training Considering Batch Size</atitle><date>2024-12-02</date><risdate>2024</risdate><abstract>Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress. In this paper, we empirically investigate how a critical hyper-parameter, i.e., the global batch size, influences the LLM training prdocess. We begin by training language models ranging from 125 million to 2.6 billion parameters, using up to 300 billion high-quality tokens. Through these experiments, we establish a basic scaling law on model size and training data amount. We then examine how varying batch sizes and learning rates affect the convergence and generalization of these models. Our analysis yields batch size scaling laws under two different cases: with a fixed compute budget, and with a fixed amount of training data. Extrapolation experiments on models of increasing sizes validate our predicted laws, which provides guidance for optimizing LLM training strategies under specific resource constraints.</abstract><doi>10.48550/arxiv.2412.01505</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2412.01505
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2412_01505
source	arXiv.org
subjects	Computer Science - Computation and Language Computer Science - Learning
title	Scaling Law for Language Models Training Considering Batch Size
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-03T15%3A20%3A49IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Scaling%20Law%20for%20Language%20Models%20Training%20Considering%20Batch%20Size&rft.au=Shuai,%20Xian&rft.date=2024-12-02&rft_id=info:doi/10.48550/arxiv.2412.01505&rft_dat=%3Carxiv_GOX%3E2412_01505%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true