Scaling Autoregressive Video Models
Due to the statistical complexity of video, the high degree of inherent stochasticity, and the sheer amount of data, generating natural video remains a challenging task. State-of-the-art video generation models often attempt to address these issues by combining sometimes complex, usually video-speci...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Due to the statistical complexity of video, the high degree of inherent
stochasticity, and the sheer amount of data, generating natural video remains a
challenging task. State-of-the-art video generation models often attempt to
address these issues by combining sometimes complex, usually video-specific
neural network architectures, latent variable models, adversarial training and
a range of other methods. Despite their often high complexity, these approaches
still fall short of generating high quality video continuations outside of
narrow domains and often struggle with fidelity. In contrast, we show that
conceptually simple autoregressive video generation models based on a
three-dimensional self-attention mechanism achieve competitive results across
multiple metrics on popular benchmark datasets, for which they produce
continuations of high fidelity and realism. We also present results from
training our models on Kinetics, a large scale action recognition dataset
comprised of YouTube videos exhibiting phenomena such as camera movement,
complex object interactions and diverse human movement. While modeling these
phenomena consistently remains elusive, we hope that our results, which include
occasional realistic continuations encourage further research on comparatively
complex, large scale datasets such as Kinetics. |
---|---|
DOI: | 10.48550/arxiv.1906.02634 |