Language-driven Scene Synthesis using Multi-conditional Diffusion Model
Scene synthesis is a challenging problem with several industrial applications. Recently, substantial efforts have been directed to synthesize the scene using human motions, room layouts, or spatial graphs as the input. However, few studies have addressed this problem from multiple modalities, especi...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Scene synthesis is a challenging problem with several industrial
applications. Recently, substantial efforts have been directed to synthesize
the scene using human motions, room layouts, or spatial graphs as the input.
However, few studies have addressed this problem from multiple modalities,
especially combining text prompts. In this paper, we propose a language-driven
scene synthesis task, which is a new task that integrates text prompts, human
motion, and existing objects for scene synthesis. Unlike other single-condition
synthesis tasks, our problem involves multiple conditions and requires a
strategy for processing and encoding them into a unified space. To address the
challenge, we present a multi-conditional diffusion model, which differs from
the implicit unification approach of other diffusion literature by explicitly
predicting the guiding points for the original data distribution. We
demonstrate that our approach is theoretically supportive. The intensive
experiment results illustrate that our method outperforms state-of-the-art
benchmarks and enables natural scene editing applications. The source code and
dataset can be accessed at https://lang-scene-synth.github.io/. |
---|---|
DOI: | 10.48550/arxiv.2310.15948 |