Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models
Fine-tuning text-to-image diffusion models is widely used for personalization and adaptation for new domains. In this paper, we identify a critical vulnerability of fine-tuning: safety alignment methods designed to filter harmful content (e.g., nudity) can break down during fine-tuning, allowing pre...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Fine-tuning text-to-image diffusion models is widely used for personalization
and adaptation for new domains. In this paper, we identify a critical
vulnerability of fine-tuning: safety alignment methods designed to filter
harmful content (e.g., nudity) can break down during fine-tuning, allowing
previously suppressed content to resurface, even when using benign datasets.
While this "fine-tuning jailbreaking" issue is known in large language models,
it remains largely unexplored in text-to-image diffusion models. Our
investigation reveals that standard fine-tuning can inadvertently undo safety
measures, causing models to relearn harmful concepts that were previously
removed and even exacerbate harmful behaviors. To address this issue, we
present a novel but immediate solution called Modular LoRA, which involves
training Safety Low-Rank Adaptation (LoRA) modules separately from Fine-Tuning
LoRA components and merging them during inference. This method effectively
prevents the re-learning of harmful content without compromising the model's
performance on new tasks. Our experiments demonstrate that Modular LoRA
outperforms traditional fine-tuning methods in maintaining safety alignment,
offering a practical approach for enhancing the security of text-to-image
diffusion models against potential attacks. |
---|---|
DOI: | 10.48550/arxiv.2412.00357 |