PaCE: Parsimonious Concept Engineering for Large Language Models
Large Language Models (LLMs) are being used for a wide variety of tasks. While they are capable of generating human-like responses, they can also produce undesirable output including potentially harmful information, racist or sexist language, and hallucinations. Alignment methods are designed to red...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Large Language Models (LLMs) are being used for a wide variety of tasks.
While they are capable of generating human-like responses, they can also
produce undesirable output including potentially harmful information, racist or
sexist language, and hallucinations. Alignment methods are designed to reduce
such undesirable outputs via techniques such as fine-tuning, prompt
engineering, and representation engineering. However, existing methods face
several challenges: some require costly fine-tuning for every alignment task;
some do not adequately remove undesirable concepts, failing alignment; some
remove benign concepts, lowering the linguistic capabilities of LLMs. To
address these issues, we propose Parsimonious Concept Engineering (PaCE), a
novel activation engineering framework for alignment. First, to sufficiently
model the concepts, we construct a large-scale concept dictionary in the
activation space, in which each atom corresponds to a semantic concept. Given
any alignment task, we instruct a concept partitioner to efficiently annotate
the concepts as benign or undesirable. Then, at inference time, we decompose
the LLM activations along the concept dictionary via sparse coding, to
accurately represent the activations as linear combinations of benign and
undesirable components. By removing the latter ones from the activations, we
reorient the behavior of the LLM towards the alignment goal. We conduct
experiments on tasks such as response detoxification, faithfulness enhancement,
and sentiment revising, and show that PaCE achieves state-of-the-art alignment
performance while maintaining linguistic capabilities. |
---|---|
DOI: | 10.48550/arxiv.2406.04331 |