In-context learning and Occam's razor
A central goal of machine learning is generalization. While the No Free Lunch Theorem states that we cannot obtain theoretical guarantees for generalization without further assumptions, in practice we observe that simple models which explain the training data generalize best: a principle called Occa...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | A central goal of machine learning is generalization. While the No Free Lunch
Theorem states that we cannot obtain theoretical guarantees for generalization
without further assumptions, in practice we observe that simple models which
explain the training data generalize best: a principle called Occam's razor.
Despite the need for simple models, most current approaches in machine learning
only minimize the training error, and at best indirectly promote simplicity
through regularization or architecture design. Here, we draw a connection
between Occam's razor and in-context learning: an emergent ability of certain
sequence models like Transformers to learn at inference time from past
observations in a sequence. In particular, we show that the next-token
prediction loss used to train in-context learners is directly equivalent to a
data compression technique called prequential coding, and that minimizing this
loss amounts to jointly minimizing both the training error and the complexity
of the model that was implicitly learned from context. Our theory and the
empirical experiments we use to support it not only provide a normative account
of in-context learning, but also elucidate the shortcomings of current
in-context learning methods, suggesting ways in which they can be improved. We
make our code available at https://github.com/3rdCore/PrequentialCode. |
---|---|
DOI: | 10.48550/arxiv.2410.14086 |