Discovering Language Model Behaviors with Model-Written Evaluations

As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2022-12
Hauptverfasser: Perez, Ethan, Ringer, Sam, Lukošiūtė, Kamilė, Nguyen, Karina, Chen, Edwin, Scott, Heiner, Pettit, Craig, Olsson, Catherine, Kundu, Sandipan, Kadavath, Saurav, Jones, Andy, Chen, Anna, Mann, Ben, Israel, Brian, Seethor, Bryan, McKinnon, Cameron, Olah, Christopher, Yan, Da, Amodei, Daniela, Amodei, Dario, Drain, Dawn, Li, Dustin, Tran-Johnson, Eli, Khundadze, Guro, Jackson Kernion, Landis, James, Kerr, Jamie, Mueller, Jared, Jeeyoon Hyun, Landau, Joshua, Ndousse, Kamal, Goldberg, Landon, Lovitt, Liane, Lucas, Martin, Sellitto, Michael, Zhang, Miranda, Kingsland, Neerav, Nelson Elhage, Nicholas, Joseph, Mercado, Noemí, DasSarma, Nova, Rausch, Oliver, Larson, Robin, McCandlish, Sam, Johnston, Scott, Kravec, Shauna, Sheer El Showk, Lanham, Tamera, Telleen-Lawton, Timothy, Brown, Tom, Henighan, Tom, Hume, Tristan, Bai, Yuntao, Hatfield-Dodds, Zac, Clark, Jack, Bowman, Samuel R, Askell, Amanda, Grosse, Roger, Hernandez, Danny, Ganguli, Deep, Hubinger, Evan, Schiefer, Nicholas, Kaplan, Jared
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!