SSGVS: Semantic Scene Graph-to-Video Synthesis
As a natural extension of the image synthesis task, video synthesis has attracted a lot of interest recently. Many image synthesis works utilize class labels or text as guidance. However, neither labels nor text can provide explicit temporal guidance, such as when an action starts or ends. To overco...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | As a natural extension of the image synthesis task, video synthesis has
attracted a lot of interest recently. Many image synthesis works utilize class
labels or text as guidance. However, neither labels nor text can provide
explicit temporal guidance, such as when an action starts or ends. To overcome
this limitation, we introduce semantic video scene graphs as input for video
synthesis, as they represent the spatial and temporal relationships between
objects in the scene. Since video scene graphs are usually temporally discrete
annotations, we propose a video scene graph (VSG) encoder that not only encodes
the existing video scene graphs but also predicts the graph representations for
unlabeled frames. The VSG encoder is pre-trained with different contrastive
multi-modal losses. A semantic scene graph-to-video synthesis framework
(SSGVS), based on the pre-trained VSG encoder, VQ-VAE, and auto-regressive
Transformer, is proposed to synthesize a video given an initial scene image and
a non-fixed number of semantic scene graphs. We evaluate SSGVS and other
state-of-the-art video synthesis models on the Action Genome dataset and
demonstrate the positive significance of video scene graphs in video synthesis.
The source code will be released. |
---|---|
DOI: | 10.48550/arxiv.2211.06119 |