Personas as a Way to Model Truthfulness in Language Models
Large language models (LLMs) are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. While unintuitive from a classic view of LMs, recent work has shown that the truth value of a statement can be elicited from the model's re...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Large language models (LLMs) are trained on vast amounts of text from the
internet, which contains both factual and misleading information about the
world. While unintuitive from a classic view of LMs, recent work has shown that
the truth value of a statement can be elicited from the model's
representations. This paper presents an explanation for why LMs appear to know
the truth despite not being trained with truth labels. We hypothesize that the
pretraining data is generated by groups of (un)truthful agents whose outputs
share common features, and they form a (un)truthful persona. By training on
this data, LMs can infer and represent the persona in its activation space.
This allows the model to separate truth from falsehoods and controls the
truthfulness of its generation. We show evidence for the persona hypothesis via
two observations: (1) we can probe whether a model's answer will be truthful
before it is generated; (2) finetuning a model on a set of facts improves its
truthfulness on unseen topics. Next, using arithmetics as a synthetic
environment, we show that structures of the pretraining data are crucial for
the model to infer the truthful persona. Overall, our findings suggest that
models can exploit hierarchical structures in the data to learn abstract
concepts like truthfulness. |
---|---|
DOI: | 10.48550/arxiv.2310.18168 |