Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization
In a surge of text-to-image (T2I) models and their customization methods that generate new images of a user-provided subject, current works focus on alleviating the costs incurred by a lengthy per-subject optimization. These zero-shot customization methods encode the image of a specified subject int...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In a surge of text-to-image (T2I) models and their customization methods that
generate new images of a user-provided subject, current works focus on
alleviating the costs incurred by a lengthy per-subject optimization. These
zero-shot customization methods encode the image of a specified subject into a
visual embedding which is then utilized alongside the textual embedding for
diffusion guidance. The visual embedding incorporates intrinsic information
about the subject, while the textual embedding provides a new, transient
context. However, the existing methods often 1) are significantly affected by
the input images, eg., generating images with the same pose, and 2) exhibit
deterioration in the subject's identity. We first pin down the problem and show
that redundant pose information in the visual embedding interferes with the
textual embedding containing the desired pose information. To address this
issue, we propose orthogonal visual embedding which effectively harmonizes with
the given textual embedding. We also adopt the visual-only embedding and inject
the subject's clear features utilizing a self-attention swap. Our results
demonstrate the effectiveness and robustness of our method, which offers highly
flexible zero-shot generation while effectively maintaining the subject's
identity. |
---|---|
DOI: | 10.48550/arxiv.2403.14155 |