AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs
Domain reweighting is an emerging research area aimed at adjusting the relative weights of different data sources to improve the effectiveness and efficiency of language model pre-training. This paper demonstrates that the optimal composition of training data from different domains is scale-dependen...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Domain reweighting is an emerging research area aimed at adjusting the
relative weights of different data sources to improve the effectiveness and
efficiency of language model pre-training. This paper demonstrates that the
optimal composition of training data from different domains is scale-dependent,
challenging the existing practice of determining optimal mixtures through
small-scale experiments and directly applying them at larger scales. We derive
an analytical model for the dependence of optimal weights on data scale and
introduce *AutoScale*, a novel, practical approach for optimizing data
compositions at potentially large training data scales. *AutoScale* first uses
a principled optimization framework to find optimal compositions at smaller,
feasible scales, then predicts optimal compositions at larger scales using our
derived model. Our evaluation on GPT-2 Large and BERT pre-training demonstrates
*AutoScale*'s effectiveness in improving training convergence and downstream
performance. Particularly, for GPT-2 Large on RedPajama, *AutoScale* decreases
validation perplexity 28% faster than baselines, with up to 38% speed-up over
unweighted training, achieving the best performance across downstream tasks.
This work provides insights into the varying benefits of data sources across
training scales for language models, contributing to the burgeoning research on
scale-dependent data curation. Code is open-sourced. |
---|---|
DOI: | 10.48550/arxiv.2407.20177 |