Eliciting Latent Knowledge from Quirky Language Models

Eliciting Latent Knowledge (ELK) aims to find patterns in a capable neural network's activations that robustly track the true state of the world, especially in hard-to-verify cases where the model's output is untrusted. To further ELK research, we introduce 12 datasets and a corresponding...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-08
Hauptverfasser:	Mallen, Alex, Brumley, Madeline, Kharchenko, Julia, Belrose, Nora
Format:	Artikel
Sprache:	eng
Schlagworte:	Anomalies Neural networks Systematic errors
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Schreiben Sie den ersten Kommentar!