Entropy Law: The Story Behind Data Compression and LLM Performance
Data is the cornerstone of large language models (LLMs), but not all data is useful for model learning. Carefully selected data can better elicit the capabilities of LLMs with much less computational overhead. Most methods concentrate on evaluating the quality of individual samples in data selection...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Data is the cornerstone of large language models (LLMs), but not all data is
useful for model learning. Carefully selected data can better elicit the
capabilities of LLMs with much less computational overhead. Most methods
concentrate on evaluating the quality of individual samples in data selection,
while the combinatorial effects among samples are neglected. Even if each
sample is of perfect quality, their combinations may be suboptimal in teaching
LLMs due to their intrinsic homogeneity or contradiction. In this paper, we aim
to uncover the underlying relationships between LLM performance and data
selection. Inspired by the information compression nature of LLMs, we uncover
an ``entropy law'' that connects LLM performance with data compression ratio
and first-epoch training loss, which reflect the information redundancy of a
dataset and the mastery of inherent knowledge encoded in this dataset,
respectively. Through both theoretical deduction and empirical evaluation, we
find that model performance is negatively correlated to the compression ratio
of training data, which usually yields a lower training loss. Based on the
findings of the entropy law, we propose a quite efficient and universal data
selection method named \textbf{ZIP} for training LLMs, which aim to prioritize
data subsets exhibiting a low compression ratio. Based on a multi-stage
algorithm that selects diverse data in a greedy manner, we can obtain a good
data subset with satisfactory diversity. Extensive experiments have been
conducted to validate the entropy law and the superiority of ZIP across
different LLM backbones and alignment stages. We also present an interesting
application of entropy law that can detect potential performance risks at the
beginning of model training. |
---|---|
DOI: | 10.48550/arxiv.2407.06645 |