Locate, Assign, Refine: Taming Customized Image Inpainting with Text-Subject Guidance
Prior studies have made significant progress in image inpainting guided by either text or subject image. However, the research on editing with their combined guidance is still in the early stages. To tackle this challenge, we present LAR-Gen, a novel approach for image inpainting that enables seamle...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Prior studies have made significant progress in image inpainting guided by
either text or subject image. However, the research on editing with their
combined guidance is still in the early stages. To tackle this challenge, we
present LAR-Gen, a novel approach for image inpainting that enables seamless
inpainting of masked scene images, incorporating both the textual prompts and
specified subjects. Our approach adopts a coarse-to-fine manner to ensure
subject identity preservation and local semantic coherence. The process
involves (i) Locate: concatenating the noise with masked scene image to achieve
precise regional editing, (ii) Assign: employing decoupled cross-attention
mechanism to accommodate multi-modal guidance, and (iii) Refine: using a novel
RefineNet to supplement subject details. Additionally, to address the issue of
scarce training data, we introduce a novel data construction pipeline. This
pipeline extracts substantial pairs of data consisting of local text prompts
and corresponding visual instances from a vast image dataset, leveraging
publicly available large models. Extensive experiments and varied application
scenarios demonstrate the superiority of LAR-Gen in terms of both identity
preservation and text semantic consistency. Project page can be found at
\url{https://ali-vilab.github.io/largen-page/}. |
---|---|
DOI: | 10.48550/arxiv.2403.19534 |