Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models
Text-to-image generation has witnessed significant advancements with the integration of Large Vision-Language Models (LVLMs), yet challenges remain in aligning complex textual descriptions with high-quality, visually coherent images. This paper introduces the Vision-Language Aligned Diffusion (VLAD)...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Text-to-image generation has witnessed significant advancements with the
integration of Large Vision-Language Models (LVLMs), yet challenges remain in
aligning complex textual descriptions with high-quality, visually coherent
images. This paper introduces the Vision-Language Aligned Diffusion (VLAD)
model, a generative framework that addresses these challenges through a
dual-stream strategy combining semantic alignment and hierarchical diffusion.
VLAD utilizes a Contextual Composition Module (CCM) to decompose textual
prompts into global and local representations, ensuring precise alignment with
visual features. Furthermore, it incorporates a multi-stage diffusion process
with hierarchical guidance to generate high-fidelity images. Experiments
conducted on MARIO-Eval and INNOVATOR-Eval benchmarks demonstrate that VLAD
significantly outperforms state-of-the-art methods in terms of image quality,
semantic alignment, and text rendering accuracy. Human evaluations further
validate the superior performance of VLAD, making it a promising approach for
text-to-image generation in complex scenarios. |
---|---|
DOI: | 10.48550/arxiv.2501.00917 |