Information Theoretic Text-to-Image Alignment
Diffusion models for Text-to-Image (T2I) conditional generation have recently achieved tremendous success. Yet, aligning these models with user's intentions still involves a laborious trial-and-error process, and this challenging alignment problem has attracted considerable attention from the r...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Diffusion models for Text-to-Image (T2I) conditional generation have recently
achieved tremendous success. Yet, aligning these models with user's intentions
still involves a laborious trial-and-error process, and this challenging
alignment problem has attracted considerable attention from the research
community. In this work, instead of relying on fine-grained linguistic analyses
of prompts, human annotation, or auxiliary vision-language models, we use
Mutual Information (MI) to guide model alignment. In brief, our method uses
self-supervised fine-tuning and relies on a point-wise (MI) estimation between
prompts and images to create a synthetic fine-tuning set for improving model
alignment. Our analysis indicates that our method is superior to the
state-of-the-art, yet it only requires the pre-trained denoising network of the
T2I model itself to estimate MI, and a simple fine-tuning strategy that
improves alignment while maintaining image quality. |
---|---|
DOI: | 10.48550/arxiv.2405.20759 |