LLM-grounded Video Diffusion Models
Text-conditioned diffusion models have emerged as a promising tool for neural video generation. However, current models still struggle with intricate spatiotemporal prompts and often generate restricted or incorrect motion. To address these limitations, we introduce LLM-grounded Video Diffusion (LVD...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Text-conditioned diffusion models have emerged as a promising tool for neural
video generation. However, current models still struggle with intricate
spatiotemporal prompts and often generate restricted or incorrect motion. To
address these limitations, we introduce LLM-grounded Video Diffusion (LVD).
Instead of directly generating videos from the text inputs, LVD first leverages
a large language model (LLM) to generate dynamic scene layouts based on the
text inputs and subsequently uses the generated layouts to guide a diffusion
model for video generation. We show that LLMs are able to understand complex
spatiotemporal dynamics from text alone and generate layouts that align closely
with both the prompts and the object motion patterns typically observed in the
real world. We then propose to guide video diffusion models with these layouts
by adjusting the attention maps. Our approach is training-free and can be
integrated into any video diffusion model that admits classifier guidance. Our
results demonstrate that LVD significantly outperforms its base video diffusion
model and several strong baseline methods in faithfully generating videos with
the desired attributes and motion patterns. |
---|---|
DOI: | 10.48550/arxiv.2309.17444 |