Irreducible Curriculum for Language Model Pretraining
Automatic data selection and curriculum design for training large language models is challenging, with only a few existing methods showing improvements over standard training. Furthermore, current schemes focus on domain-level selection, overlooking the more fine-grained contributions of each indivi...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Automatic data selection and curriculum design for training large language
models is challenging, with only a few existing methods showing improvements
over standard training. Furthermore, current schemes focus on domain-level
selection, overlooking the more fine-grained contributions of each individual
training point. It is difficult to apply traditional datapoint selection
methods on large language models: most online batch selection methods perform
two-times forward or backward passes, which introduces considerable extra costs
with large-scale models. To mitigate these obstacles, we propose irreducible
curriculum as a curriculum learning algorithm for language model pretraining,
which prioritizes samples with higher learnability. Specifically, to avoid
prohibitive extra computation overhead, we simulate the sample loss along the
main model's training trajectory using a small-scale proxy model. Our
experiments on the RedPajama-1B dataset demonstrate a consistent improvement on
validation perplexity across all 7 domains compared to random uniform baseline
and the anti-curriculum strategy. Our method also reduces the sharpness of the
network and illustrates a better 5-shot accuracy on MMLU benchmarks. |
---|---|
DOI: | 10.48550/arxiv.2310.15389 |