DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion
Despite the remarkable achievements of language models (LMs) across a broad spectrum of tasks, their propensity for generating toxic outputs remains a prevalent concern. Current solutions involving finetuning or auxiliary models usually require extensive computational resources, hindering their prac...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Despite the remarkable achievements of language models (LMs) across a broad
spectrum of tasks, their propensity for generating toxic outputs remains a
prevalent concern. Current solutions involving finetuning or auxiliary models
usually require extensive computational resources, hindering their practicality
in large language models (LLMs). In this paper, we propose DeStein, a novel
method that detoxifies LMs by applying representation engineering in activation
spaces with lower resource and time costs. Specifically, we derive
detoxification vectors from self-induced, universal steering pairs through
arithmetic operations in activation spaces. During inference, detoxification is
achieved by fusing the detoxification vectors with the original representations
in a head-wise manner. Empirical results demonstrate that our method
significantly outperforms previous state-of-the-art approaches on various
metrics, while also maintaining satisfactory generation quality and diversity.
We further validate the practicality and scalability of DeStein with a series
of white-box LLMs. The method is open-sourced at
https://github.com/LizLizLi/DeStein. Warning: Some example model outputs may
contain highly offensive or disturbing text. |
---|---|
DOI: | 10.48550/arxiv.2404.10464 |