Latent Space Disentanglement in Diffusion Transformers Enables Precise Zero-shot Semantic Editing
Diffusion Transformers (DiTs) have recently achieved remarkable success in text-guided image generation. In image editing, DiTs project text and image inputs to a joint latent space, from which they decode and synthesize new images. However, it remains largely unexplored how multimodal information c...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Diffusion Transformers (DiTs) have recently achieved remarkable success in
text-guided image generation. In image editing, DiTs project text and image
inputs to a joint latent space, from which they decode and synthesize new
images. However, it remains largely unexplored how multimodal information
collectively forms this joint space and how they guide the semantics of the
synthesized images. In this paper, we investigate the latent space of DiT
models and uncover two key properties: First, DiT's latent space is inherently
semantically disentangled, where different semantic attributes can be
controlled by specific editing directions. Second, consistent semantic editing
requires utilizing the entire joint latent space, as neither encoded image nor
text alone contains enough semantic information. We show that these editing
directions can be obtained directly from text prompts, enabling precise
semantic control without additional training or mask annotations. Based on
these insights, we propose a simple yet effective Encode-Identify-Manipulate
(EIM) framework for zero-shot fine-grained image editing. Specifically, we
first encode both the given source image and the text prompt that describes the
image, to obtain the joint latent embedding. Then, using our proposed Hessian
Score Distillation Sampling (HSDS) method, we identify editing directions that
control specific target attributes while preserving other image features. These
directions are guided by text prompts and used to manipulate the latent
embeddings. Moreover, we propose a new metric to quantify the disentanglement
degree of the latent space of diffusion models. Extensive experiment results on
our new curated benchmark dataset and analysis demonstrate DiT's
disentanglement properties and effectiveness of the EIM framework. |
---|---|
DOI: | 10.48550/arxiv.2411.08196 |