Diversity and Diffusion: Observations on Synthetic Image Distributions with Stable Diffusion
Recent progress in text-to-image (TTI) systems, such as StableDiffusion, Imagen, and DALL-E 2, have made it possible to create realistic images with simple text prompts. It is tempting to use these systems to eliminate the manual task of obtaining natural images for training a new machine learning c...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recent progress in text-to-image (TTI) systems, such as StableDiffusion,
Imagen, and DALL-E 2, have made it possible to create realistic images with
simple text prompts. It is tempting to use these systems to eliminate the
manual task of obtaining natural images for training a new machine learning
classifier. However, in all of the experiments performed to date, classifiers
trained solely with synthetic images perform poorly at inference, despite the
images used for training appearing realistic. Examining this apparent
incongruity in detail gives insight into the limitations of the underlying
image generation processes. Through the lens of diversity in image creation
vs.accuracy of what is created, we dissect the differences in semantic
mismatches in what is modeled in synthetic vs. natural images. This will
elucidate the roles of the image-languag emodel, CLIP, and the image generation
model, diffusion. We find four issues that limit the usefulness of TTI systems
for this task: ambiguity, adherence to prompt, lack of diversity, and inability
to represent the underlying concept. We further present surprising insights
into the geometry of CLIP embeddings. |
---|---|
DOI: | 10.48550/arxiv.2311.00056 |