Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation
With the rise of text-to-image (T2I) generative AI models reaching wide audiences, it is critical to evaluate model robustness against non-obvious attacks to mitigate the generation of offensive images. By focusing on ``implicitly adversarial'' prompts (those that trigger T2I models to gen...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | With the rise of text-to-image (T2I) generative AI models reaching wide
audiences, it is critical to evaluate model robustness against non-obvious
attacks to mitigate the generation of offensive images. By focusing on
``implicitly adversarial'' prompts (those that trigger T2I models to generate
unsafe images for non-obvious reasons), we isolate a set of difficult safety
issues that human creativity is well-suited to uncover. To this end, we built
the Adversarial Nibbler Challenge, a red-teaming methodology for crowdsourcing
a diverse set of implicitly adversarial prompts. We have assembled a suite of
state-of-the-art T2I models, employed a simple user interface to identify and
annotate harms, and engaged diverse populations to capture long-tail safety
issues that may be overlooked in standard testing. The challenge is run in
consecutive rounds to enable a sustained discovery and analysis of safety
pitfalls in T2I models.
In this paper, we present an in-depth account of our methodology, a
systematic study of novel attack strategies and discussion of safety failures
revealed by challenge participants. We also release a companion visualization
tool for easy exploration and derivation of insights from the dataset. The
first challenge round resulted in over 10k prompt-image pairs with machine
annotations for safety. A subset of 1.5k samples contains rich human
annotations of harm types and attack styles. We find that 14% of images that
humans consider harmful are mislabeled as ``safe'' by machines. We have
identified new attack strategies that highlight the complexity of ensuring T2I
model robustness. Our findings emphasize the necessity of continual auditing
and adaptation as new vulnerabilities emerge. We are confident that this work
will enable proactive, iterative safety assessments and promote responsible
development of T2I models. |
---|---|
DOI: | 10.48550/arxiv.2403.12075 |