The Power of Next-Frame Prediction for Learning Physical Laws
Next-frame prediction is a useful and powerful method for modelling and understanding the dynamics of video data. Inspired by the empirical success of causal language modelling and next-token prediction in language modelling, we explore the extent to which next-frame prediction serves as a strong fo...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Next-frame prediction is a useful and powerful method for modelling and
understanding the dynamics of video data. Inspired by the empirical success of
causal language modelling and next-token prediction in language modelling, we
explore the extent to which next-frame prediction serves as a strong
foundational learning strategy (analogous to language modelling) for inducing
an understanding of the visual world. In order to quantify the specific visual
understanding induced by next-frame prediction, we introduce six diagnostic
simulation video datasets derived from fundamental physical laws created by
varying physical constants such as gravity and mass. We demonstrate that our
models trained only on next-frame prediction are capable of predicting the
value of these physical constants (e.g. gravity) without having been trained
directly to learn these constants via a regression task. We find that the
generative training phase alone induces a model state that can predict physical
constants significantly better than that of a random model, improving the loss
by a factor of between 1.28 to 6.24. We conclude that next-frame prediction
shows great promise as a general learning strategy to induce understanding of
the many `laws' that govern the visual domain without the need for explicit
labelling. |
---|---|
DOI: | 10.48550/arxiv.2405.17450 |