Scaling Law for Language Models Training Considering Batch Size
Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress. In this paper, we empirically investigate how a critical hyper-parameter, i.e., the global batch size, influences the LLM training prdocess. We begin by train...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Shuai, Xian Wang, Yiding Wu, Yimeng Jiang, Xin Ren, Xiaozhe |
description | Large language models (LLMs) have made remarkable advances in recent years,
with scaling laws playing a critical role in this rapid progress. In this
paper, we empirically investigate how a critical hyper-parameter, i.e., the
global batch size, influences the LLM training prdocess. We begin by training
language models ranging from 125 million to 2.6 billion parameters, using up to
300 billion high-quality tokens. Through these experiments, we establish a
basic scaling law on model size and training data amount. We then examine how
varying batch sizes and learning rates affect the convergence and
generalization of these models. Our analysis yields batch size scaling laws
under two different cases: with a fixed compute budget, and with a fixed amount
of training data. Extrapolation experiments on models of increasing sizes
validate our predicted laws, which provides guidance for optimizing LLM
training strategies under specific resource constraints. |
doi_str_mv | 10.48550/arxiv.2412.01505 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2412_01505</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2412_01505</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2412_015053</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjE00jMwNDUw5WSwD05OzMnMS1fwSSxXSMsvAtJ56aWJ6akKvvkpqTnFCiFFiZl5IAXO-XnFmSmpRSC2U2JJcoZCcGZVKg8Da1piTnEqL5TmZpB3cw1x9tAFWxVfUJSZm1hUGQ-yMh5spTFhFQBgiTWR</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Scaling Law for Language Models Training Considering Batch Size</title><source>arXiv.org</source><creator>Shuai, Xian ; Wang, Yiding ; Wu, Yimeng ; Jiang, Xin ; Ren, Xiaozhe</creator><creatorcontrib>Shuai, Xian ; Wang, Yiding ; Wu, Yimeng ; Jiang, Xin ; Ren, Xiaozhe</creatorcontrib><description>Large language models (LLMs) have made remarkable advances in recent years,
with scaling laws playing a critical role in this rapid progress. In this
paper, we empirically investigate how a critical hyper-parameter, i.e., the
global batch size, influences the LLM training prdocess. We begin by training
language models ranging from 125 million to 2.6 billion parameters, using up to
300 billion high-quality tokens. Through these experiments, we establish a
basic scaling law on model size and training data amount. We then examine how
varying batch sizes and learning rates affect the convergence and
generalization of these models. Our analysis yields batch size scaling laws
under two different cases: with a fixed compute budget, and with a fixed amount
of training data. Extrapolation experiments on models of increasing sizes
validate our predicted laws, which provides guidance for optimizing LLM
training strategies under specific resource constraints.</description><identifier>DOI: 10.48550/arxiv.2412.01505</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Learning</subject><creationdate>2024-12</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2412.01505$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2412.01505$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Shuai, Xian</creatorcontrib><creatorcontrib>Wang, Yiding</creatorcontrib><creatorcontrib>Wu, Yimeng</creatorcontrib><creatorcontrib>Jiang, Xin</creatorcontrib><creatorcontrib>Ren, Xiaozhe</creatorcontrib><title>Scaling Law for Language Models Training Considering Batch Size</title><description>Large language models (LLMs) have made remarkable advances in recent years,
with scaling laws playing a critical role in this rapid progress. In this
paper, we empirically investigate how a critical hyper-parameter, i.e., the
global batch size, influences the LLM training prdocess. We begin by training
language models ranging from 125 million to 2.6 billion parameters, using up to
300 billion high-quality tokens. Through these experiments, we establish a
basic scaling law on model size and training data amount. We then examine how
varying batch sizes and learning rates affect the convergence and
generalization of these models. Our analysis yields batch size scaling laws
under two different cases: with a fixed compute budget, and with a fixed amount
of training data. Extrapolation experiments on models of increasing sizes
validate our predicted laws, which provides guidance for optimizing LLM
training strategies under specific resource constraints.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjE00jMwNDUw5WSwD05OzMnMS1fwSSxXSMsvAtJ56aWJ6akKvvkpqTnFCiFFiZl5IAXO-XnFmSmpRSC2U2JJcoZCcGZVKg8Da1piTnEqL5TmZpB3cw1x9tAFWxVfUJSZm1hUGQ-yMh5spTFhFQBgiTWR</recordid><startdate>20241202</startdate><enddate>20241202</enddate><creator>Shuai, Xian</creator><creator>Wang, Yiding</creator><creator>Wu, Yimeng</creator><creator>Jiang, Xin</creator><creator>Ren, Xiaozhe</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241202</creationdate><title>Scaling Law for Language Models Training Considering Batch Size</title><author>Shuai, Xian ; Wang, Yiding ; Wu, Yimeng ; Jiang, Xin ; Ren, Xiaozhe</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2412_015053</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Shuai, Xian</creatorcontrib><creatorcontrib>Wang, Yiding</creatorcontrib><creatorcontrib>Wu, Yimeng</creatorcontrib><creatorcontrib>Jiang, Xin</creatorcontrib><creatorcontrib>Ren, Xiaozhe</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Shuai, Xian</au><au>Wang, Yiding</au><au>Wu, Yimeng</au><au>Jiang, Xin</au><au>Ren, Xiaozhe</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Scaling Law for Language Models Training Considering Batch Size</atitle><date>2024-12-02</date><risdate>2024</risdate><abstract>Large language models (LLMs) have made remarkable advances in recent years,
with scaling laws playing a critical role in this rapid progress. In this
paper, we empirically investigate how a critical hyper-parameter, i.e., the
global batch size, influences the LLM training prdocess. We begin by training
language models ranging from 125 million to 2.6 billion parameters, using up to
300 billion high-quality tokens. Through these experiments, we establish a
basic scaling law on model size and training data amount. We then examine how
varying batch sizes and learning rates affect the convergence and
generalization of these models. Our analysis yields batch size scaling laws
under two different cases: with a fixed compute budget, and with a fixed amount
of training data. Extrapolation experiments on models of increasing sizes
validate our predicted laws, which provides guidance for optimizing LLM
training strategies under specific resource constraints.</abstract><doi>10.48550/arxiv.2412.01505</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2412.01505 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2412_01505 |
source | arXiv.org |
subjects | Computer Science - Computation and Language Computer Science - Learning |
title | Scaling Law for Language Models Training Considering Batch Size |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-03T15%3A20%3A49IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Scaling%20Law%20for%20Language%20Models%20Training%20Considering%20Batch%20Size&rft.au=Shuai,%20Xian&rft.date=2024-12-02&rft_id=info:doi/10.48550/arxiv.2412.01505&rft_dat=%3Carxiv_GOX%3E2412_01505%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |