UNDIAL: Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models
Mitigating the retention of sensitive or private information in large language models is essential for enhancing privacy and safety. Existing unlearning methods, like Gradient Ascent and Negative Preference Optimization, directly tune models to remove unwanted information. However, these methods oft...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Mitigating the retention of sensitive or private information in large
language models is essential for enhancing privacy and safety. Existing
unlearning methods, like Gradient Ascent and Negative Preference Optimization,
directly tune models to remove unwanted information. However, these methods
often become unstable because they fine-tune by maximizing cross-entropy loss,
which is the opposite of traditional loss minimization in learning. This
reversal creates instability, especially on larger datasets, as the model
struggles to balance unlearning with maintaining language capacity, leading to
over-unlearning. In this paper, we introduce UnDIAL (Unlearning via
Self-Distillation on Adjusted Logits), a novel and robust unlearning method.
Our approach leverages self-distillation to adjust logits and selectively
reduce the influence of targeted tokens. This technique ensures smooth
convergence and avoids catastrophic forgetting, even in challenging unlearning
tasks with large datasets and sequential unlearning requests. Extensive
experiments show that UnDIAL can achieve both robustness in unlearning and
scalability while maintaining stable training dynamics and resilience to
hyperparameter tuning. |
---|---|
DOI: | 10.48550/arxiv.2402.10052 |