PEEKABOO: Interactive Video Generation via Masked-Diffusion
Modern video generation models like Sora have achieved remarkable success in producing high-quality videos. However, a significant limitation is their inability to offer interactive control to users, a feature that promises to open up unprecedented applications and creativity. In this work, we intro...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Modern video generation models like Sora have achieved remarkable success in
producing high-quality videos. However, a significant limitation is their
inability to offer interactive control to users, a feature that promises to
open up unprecedented applications and creativity. In this work, we introduce
the first solution to equip diffusion-based video generation models with
spatio-temporal control. We present Peekaboo, a novel masked attention module,
which seamlessly integrates with current video generation models offering
control without the need for additional training or inference overhead. To
facilitate future research, we also introduce a comprehensive benchmark for
interactive video generation. This benchmark offers a standardized framework
for the community to assess the efficacy of emerging interactive video
generation models. Our extensive qualitative and quantitative assessments
reveal that Peekaboo achieves up to a 3.8x improvement in mIoU over baseline
models, all while maintaining the same latency. Code and benchmark are
available on the webpage. |
---|---|
DOI: | 10.48550/arxiv.2312.07509 |