LLM Unlearning via Loss Adjustment with Only Forget Data
Unlearning in Large Language Models (LLMs) is essential for ensuring ethical and responsible AI use, especially in addressing privacy leak, bias, safety, and evolving regulations. Existing approaches to LLM unlearning often rely on retain data or a reference LLM, yet they struggle to adequately bala...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Unlearning in Large Language Models (LLMs) is essential for ensuring ethical
and responsible AI use, especially in addressing privacy leak, bias, safety,
and evolving regulations. Existing approaches to LLM unlearning often rely on
retain data or a reference LLM, yet they struggle to adequately balance
unlearning performance with overall model utility. This challenge arises
because leveraging explicit retain data or implicit knowledge of retain data
from a reference LLM to fine-tune the model tends to blur the boundaries
between the forgotten and retain data, as different queries often elicit
similar responses. In this work, we propose eliminating the need to retain data
or the reference LLM for response calibration in LLM unlearning. Recognizing
that directly applying gradient ascent on the forget data often leads to
optimization instability and poor performance, our method guides the LLM on
what not to respond to, and importantly, how to respond, based on the forget
data. Hence, we introduce Forget data only Loss AjustmenT (FLAT), a "flat" loss
adjustment approach which addresses these issues by maximizing f-divergence
between the available template answer and the forget answer only w.r.t. the
forget data. The variational form of the defined f-divergence theoretically
provides a way of loss adjustment by assigning different importance weights for
the learning w.r.t. template responses and the forgetting of responses subject
to unlearning. Empirical results demonstrate that our approach not only
achieves superior unlearning performance compared to existing methods but also
minimizes the impact on the model's retained capabilities, ensuring high
utility across diverse tasks, including copyrighted content unlearning on Harry
Potter dataset and MUSE Benchmark, and entity unlearning on the TOFU dataset. |
---|---|
DOI: | 10.48550/arxiv.2410.11143 |