Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models
Safety alignment is crucial to ensure that large language models (LLMs) behave in ways that align with human preferences and prevent harmful actions during inference. However, recent studies show that the alignment can be easily compromised through finetuning with only a few adversarially designed t...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Safety alignment is crucial to ensure that large language models (LLMs)
behave in ways that align with human preferences and prevent harmful actions
during inference. However, recent studies show that the alignment can be easily
compromised through finetuning with only a few adversarially designed training
examples. We aim to measure the risks in finetuning LLMs through navigating the
LLM safety landscape. We discover a new phenomenon observed universally in the
model parameter space of popular open-source LLMs, termed as "safety basin":
random perturbations to model weights maintain the safety level of the original
aligned model within its local neighborhood. However, outside this local
region, safety is fully compromised, exhibiting a sharp, step-like drop. This
safety basin contrasts sharply with the LLM capability landscape, where model
performance peaks at the origin and gradually declines as random perturbation
increases. Our discovery inspires us to propose the new VISAGE safety metric
that measures the safety in LLM finetuning by probing its safety landscape.
Visualizing the safety landscape of the aligned model enables us to understand
how finetuning compromises safety by dragging the model away from the safety
basin. The LLM safety landscape also highlights the system prompt's critical
role in protecting a model, and that such protection transfers to its perturbed
variants within the safety basin. These observations from our safety landscape
research provide new insights for future work on LLM safety community. Our code
is publicly available at https://github.com/ShengYun-Peng/llm-landscape. |
---|---|
DOI: | 10.48550/arxiv.2405.17374 |