Espresso: Robust Concept Filtering in Text-to-Image Models
Diffusion based text-to-image models are trained on large datasets scraped from the Internet, potentially containing unacceptable concepts (e.g., copyright-infringing or unsafe). We need concept removal techniques (CRTs) which are i) effective in preventing the generation of images with unacceptable...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Diffusion based text-to-image models are trained on large datasets scraped
from the Internet, potentially containing unacceptable concepts (e.g.,
copyright-infringing or unsafe). We need concept removal techniques (CRTs)
which are i) effective in preventing the generation of images with unacceptable
concepts, ii) utility-preserving on acceptable concepts, and, iii) robust
against evasion with adversarial prompts. No prior CRT satisfies all these
requirements simultaneously. We introduce Espresso, the first robust concept
filter based on Contrastive Language-Image Pre-Training (CLIP). We identify
unacceptable concepts by using the distance between the embedding of a
generated image to the text embeddings of both unacceptable and acceptable
concepts. This lets us fine-tune for robustness by separating the text
embeddings of unacceptable and acceptable concepts while preserving utility. We
present a pipeline to evaluate various CRTs to show that Espresso is more
effective and robust than prior CRTs, while retaining utility. |
---|---|
DOI: | 10.48550/arxiv.2404.19227 |