Evaluating Language Model Character Traits
Language models (LMs) can exhibit human-like behaviour, but it is unclear how to describe this behaviour without undue anthropomorphism. We formalise a behaviourist view of LM character traits: qualities such as truthfulness, sycophancy, or coherent beliefs and intentions, which may manifest as cons...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Language models (LMs) can exhibit human-like behaviour, but it is unclear how
to describe this behaviour without undue anthropomorphism. We formalise a
behaviourist view of LM character traits: qualities such as truthfulness,
sycophancy, or coherent beliefs and intentions, which may manifest as
consistent patterns of behaviour. Our theory is grounded in empirical
demonstrations of LMs exhibiting different character traits, such as accurate
and logically coherent beliefs, and helpful and harmless intentions. We find
that the consistency with which LMs exhibit certain character traits varies
with model size, fine-tuning, and prompting. In addition to characterising LM
character traits, we evaluate how these traits develop over the course of an
interaction. We find that traits such as truthfulness and harmfulness can be
stationary, i.e., consistent over an interaction, in certain contexts, but may
be reflective in different contexts, meaning they mirror the LM's behavior in
the preceding interaction. Our formalism enables us to describe LM behaviour
precisely in intuitive language, without undue anthropomorphism. |
---|---|
DOI: | 10.48550/arxiv.2410.04272 |