Discovering Language Model Behaviors with Model-Written Evaluations
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2022-12 |
---|---|
Hauptverfasser: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Perez, Ethan Ringer, Sam Lukošiūtė, Kamilė Nguyen, Karina Chen, Edwin Scott, Heiner Pettit, Craig Olsson, Catherine Kundu, Sandipan Kadavath, Saurav Jones, Andy Chen, Anna Mann, Ben Israel, Brian Seethor, Bryan McKinnon, Cameron Olah, Christopher Yan, Da Amodei, Daniela Amodei, Dario Drain, Dawn Li, Dustin Tran-Johnson, Eli Khundadze, Guro Jackson Kernion Landis, James Kerr, Jamie Mueller, Jared Jeeyoon Hyun Landau, Joshua Ndousse, Kamal Goldberg, Landon Lovitt, Liane Lucas, Martin Sellitto, Michael Zhang, Miranda Kingsland, Neerav Nelson Elhage Nicholas, Joseph Mercado, Noemí DasSarma, Nova Rausch, Oliver Larson, Robin McCandlish, Sam Johnston, Scott Kravec, Shauna Sheer El Showk Lanham, Tamera Telleen-Lawton, Timothy Brown, Tom Henighan, Tom Hume, Tristan Bai, Yuntao Hatfield-Dodds, Zac Clark, Jack Bowman, Samuel R Askell, Amanda Grosse, Roger Hernandez, Danny Ganguli, Deep Hubinger, Evan Schiefer, Nicholas Kaplan, Jared |
description | As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2755992596</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2755992596</sourcerecordid><originalsourceid>FETCH-proquest_journals_27559925963</originalsourceid><addsrcrecordid>eNqNikEKwjAQAIMgWLR_CHgO1I1p7dVa8aA3wWMJurYpJdFsUr-voA_wNDAzE5aAlCuxWQPMWErUZ1kGeQFKyYRVO0NXN6I3tuVHbduoW-Qnd8OBb7HTo3Ge-MuE7ivFxZsQ0PJ61EPUwThLCza964Ew_XHOlvv6XB3Ew7tnRApN76K3n9RAoVRZgipz-d_1BhJWOoU</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2755992596</pqid></control><display><type>article</type><title>Discovering Language Model Behaviors with Model-Written Evaluations</title><source>Free E- Journals</source><creator>Perez, Ethan ; Ringer, Sam ; Lukošiūtė, Kamilė ; Nguyen, Karina ; Chen, Edwin ; Scott, Heiner ; Pettit, Craig ; Olsson, Catherine ; Kundu, Sandipan ; Kadavath, Saurav ; Jones, Andy ; Chen, Anna ; Mann, Ben ; Israel, Brian ; Seethor, Bryan ; McKinnon, Cameron ; Olah, Christopher ; Yan, Da ; Amodei, Daniela ; Amodei, Dario ; Drain, Dawn ; Li, Dustin ; Tran-Johnson, Eli ; Khundadze, Guro ; Jackson Kernion ; Landis, James ; Kerr, Jamie ; Mueller, Jared ; Jeeyoon Hyun ; Landau, Joshua ; Ndousse, Kamal ; Goldberg, Landon ; Lovitt, Liane ; Lucas, Martin ; Sellitto, Michael ; Zhang, Miranda ; Kingsland, Neerav ; Nelson Elhage ; Nicholas, Joseph ; Mercado, Noemí ; DasSarma, Nova ; Rausch, Oliver ; Larson, Robin ; McCandlish, Sam ; Johnston, Scott ; Kravec, Shauna ; Sheer El Showk ; Lanham, Tamera ; Telleen-Lawton, Timothy ; Brown, Tom ; Henighan, Tom ; Hume, Tristan ; Bai, Yuntao ; Hatfield-Dodds, Zac ; Clark, Jack ; Bowman, Samuel R ; Askell, Amanda ; Grosse, Roger ; Hernandez, Danny ; Ganguli, Deep ; Hubinger, Evan ; Schiefer, Nicholas ; Kaplan, Jared</creator><creatorcontrib>Perez, Ethan ; Ringer, Sam ; Lukošiūtė, Kamilė ; Nguyen, Karina ; Chen, Edwin ; Scott, Heiner ; Pettit, Craig ; Olsson, Catherine ; Kundu, Sandipan ; Kadavath, Saurav ; Jones, Andy ; Chen, Anna ; Mann, Ben ; Israel, Brian ; Seethor, Bryan ; McKinnon, Cameron ; Olah, Christopher ; Yan, Da ; Amodei, Daniela ; Amodei, Dario ; Drain, Dawn ; Li, Dustin ; Tran-Johnson, Eli ; Khundadze, Guro ; Jackson Kernion ; Landis, James ; Kerr, Jamie ; Mueller, Jared ; Jeeyoon Hyun ; Landau, Joshua ; Ndousse, Kamal ; Goldberg, Landon ; Lovitt, Liane ; Lucas, Martin ; Sellitto, Michael ; Zhang, Miranda ; Kingsland, Neerav ; Nelson Elhage ; Nicholas, Joseph ; Mercado, Noemí ; DasSarma, Nova ; Rausch, Oliver ; Larson, Robin ; McCandlish, Sam ; Johnston, Scott ; Kravec, Shauna ; Sheer El Showk ; Lanham, Tamera ; Telleen-Lawton, Timothy ; Brown, Tom ; Henighan, Tom ; Hume, Tristan ; Bai, Yuntao ; Hatfield-Dodds, Zac ; Clark, Jack ; Bowman, Samuel R ; Askell, Amanda ; Grosse, Roger ; Hernandez, Danny ; Ganguli, Deep ; Hubinger, Evan ; Schiefer, Nicholas ; Kaplan, Jared</creatorcontrib><description>As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Datasets ; Shutdowns</subject><ispartof>arXiv.org, 2022-12</ispartof><rights>2022. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Perez, Ethan</creatorcontrib><creatorcontrib>Ringer, Sam</creatorcontrib><creatorcontrib>Lukošiūtė, Kamilė</creatorcontrib><creatorcontrib>Nguyen, Karina</creatorcontrib><creatorcontrib>Chen, Edwin</creatorcontrib><creatorcontrib>Scott, Heiner</creatorcontrib><creatorcontrib>Pettit, Craig</creatorcontrib><creatorcontrib>Olsson, Catherine</creatorcontrib><creatorcontrib>Kundu, Sandipan</creatorcontrib><creatorcontrib>Kadavath, Saurav</creatorcontrib><creatorcontrib>Jones, Andy</creatorcontrib><creatorcontrib>Chen, Anna</creatorcontrib><creatorcontrib>Mann, Ben</creatorcontrib><creatorcontrib>Israel, Brian</creatorcontrib><creatorcontrib>Seethor, Bryan</creatorcontrib><creatorcontrib>McKinnon, Cameron</creatorcontrib><creatorcontrib>Olah, Christopher</creatorcontrib><creatorcontrib>Yan, Da</creatorcontrib><creatorcontrib>Amodei, Daniela</creatorcontrib><creatorcontrib>Amodei, Dario</creatorcontrib><creatorcontrib>Drain, Dawn</creatorcontrib><creatorcontrib>Li, Dustin</creatorcontrib><creatorcontrib>Tran-Johnson, Eli</creatorcontrib><creatorcontrib>Khundadze, Guro</creatorcontrib><creatorcontrib>Jackson Kernion</creatorcontrib><creatorcontrib>Landis, James</creatorcontrib><creatorcontrib>Kerr, Jamie</creatorcontrib><creatorcontrib>Mueller, Jared</creatorcontrib><creatorcontrib>Jeeyoon Hyun</creatorcontrib><creatorcontrib>Landau, Joshua</creatorcontrib><creatorcontrib>Ndousse, Kamal</creatorcontrib><creatorcontrib>Goldberg, Landon</creatorcontrib><creatorcontrib>Lovitt, Liane</creatorcontrib><creatorcontrib>Lucas, Martin</creatorcontrib><creatorcontrib>Sellitto, Michael</creatorcontrib><creatorcontrib>Zhang, Miranda</creatorcontrib><creatorcontrib>Kingsland, Neerav</creatorcontrib><creatorcontrib>Nelson Elhage</creatorcontrib><creatorcontrib>Nicholas, Joseph</creatorcontrib><creatorcontrib>Mercado, Noemí</creatorcontrib><creatorcontrib>DasSarma, Nova</creatorcontrib><creatorcontrib>Rausch, Oliver</creatorcontrib><creatorcontrib>Larson, Robin</creatorcontrib><creatorcontrib>McCandlish, Sam</creatorcontrib><creatorcontrib>Johnston, Scott</creatorcontrib><creatorcontrib>Kravec, Shauna</creatorcontrib><creatorcontrib>Sheer El Showk</creatorcontrib><creatorcontrib>Lanham, Tamera</creatorcontrib><creatorcontrib>Telleen-Lawton, Timothy</creatorcontrib><creatorcontrib>Brown, Tom</creatorcontrib><creatorcontrib>Henighan, Tom</creatorcontrib><creatorcontrib>Hume, Tristan</creatorcontrib><creatorcontrib>Bai, Yuntao</creatorcontrib><creatorcontrib>Hatfield-Dodds, Zac</creatorcontrib><creatorcontrib>Clark, Jack</creatorcontrib><creatorcontrib>Bowman, Samuel R</creatorcontrib><creatorcontrib>Askell, Amanda</creatorcontrib><creatorcontrib>Grosse, Roger</creatorcontrib><creatorcontrib>Hernandez, Danny</creatorcontrib><creatorcontrib>Ganguli, Deep</creatorcontrib><creatorcontrib>Hubinger, Evan</creatorcontrib><creatorcontrib>Schiefer, Nicholas</creatorcontrib><creatorcontrib>Kaplan, Jared</creatorcontrib><title>Discovering Language Model Behaviors with Model-Written Evaluations</title><title>arXiv.org</title><description>As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.</description><subject>Datasets</subject><subject>Shutdowns</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNikEKwjAQAIMgWLR_CHgO1I1p7dVa8aA3wWMJurYpJdFsUr-voA_wNDAzE5aAlCuxWQPMWErUZ1kGeQFKyYRVO0NXN6I3tuVHbduoW-Qnd8OBb7HTo3Ge-MuE7ivFxZsQ0PJ61EPUwThLCza964Ew_XHOlvv6XB3Ew7tnRApN76K3n9RAoVRZgipz-d_1BhJWOoU</recordid><startdate>20221219</startdate><enddate>20221219</enddate><creator>Perez, Ethan</creator><creator>Ringer, Sam</creator><creator>Lukošiūtė, Kamilė</creator><creator>Nguyen, Karina</creator><creator>Chen, Edwin</creator><creator>Scott, Heiner</creator><creator>Pettit, Craig</creator><creator>Olsson, Catherine</creator><creator>Kundu, Sandipan</creator><creator>Kadavath, Saurav</creator><creator>Jones, Andy</creator><creator>Chen, Anna</creator><creator>Mann, Ben</creator><creator>Israel, Brian</creator><creator>Seethor, Bryan</creator><creator>McKinnon, Cameron</creator><creator>Olah, Christopher</creator><creator>Yan, Da</creator><creator>Amodei, Daniela</creator><creator>Amodei, Dario</creator><creator>Drain, Dawn</creator><creator>Li, Dustin</creator><creator>Tran-Johnson, Eli</creator><creator>Khundadze, Guro</creator><creator>Jackson Kernion</creator><creator>Landis, James</creator><creator>Kerr, Jamie</creator><creator>Mueller, Jared</creator><creator>Jeeyoon Hyun</creator><creator>Landau, Joshua</creator><creator>Ndousse, Kamal</creator><creator>Goldberg, Landon</creator><creator>Lovitt, Liane</creator><creator>Lucas, Martin</creator><creator>Sellitto, Michael</creator><creator>Zhang, Miranda</creator><creator>Kingsland, Neerav</creator><creator>Nelson Elhage</creator><creator>Nicholas, Joseph</creator><creator>Mercado, Noemí</creator><creator>DasSarma, Nova</creator><creator>Rausch, Oliver</creator><creator>Larson, Robin</creator><creator>McCandlish, Sam</creator><creator>Johnston, Scott</creator><creator>Kravec, Shauna</creator><creator>Sheer El Showk</creator><creator>Lanham, Tamera</creator><creator>Telleen-Lawton, Timothy</creator><creator>Brown, Tom</creator><creator>Henighan, Tom</creator><creator>Hume, Tristan</creator><creator>Bai, Yuntao</creator><creator>Hatfield-Dodds, Zac</creator><creator>Clark, Jack</creator><creator>Bowman, Samuel R</creator><creator>Askell, Amanda</creator><creator>Grosse, Roger</creator><creator>Hernandez, Danny</creator><creator>Ganguli, Deep</creator><creator>Hubinger, Evan</creator><creator>Schiefer, Nicholas</creator><creator>Kaplan, Jared</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20221219</creationdate><title>Discovering Language Model Behaviors with Model-Written Evaluations</title><author>Perez, Ethan ; Ringer, Sam ; Lukošiūtė, Kamilė ; Nguyen, Karina ; Chen, Edwin ; Scott, Heiner ; Pettit, Craig ; Olsson, Catherine ; Kundu, Sandipan ; Kadavath, Saurav ; Jones, Andy ; Chen, Anna ; Mann, Ben ; Israel, Brian ; Seethor, Bryan ; McKinnon, Cameron ; Olah, Christopher ; Yan, Da ; Amodei, Daniela ; Amodei, Dario ; Drain, Dawn ; Li, Dustin ; Tran-Johnson, Eli ; Khundadze, Guro ; Jackson Kernion ; Landis, James ; Kerr, Jamie ; Mueller, Jared ; Jeeyoon Hyun ; Landau, Joshua ; Ndousse, Kamal ; Goldberg, Landon ; Lovitt, Liane ; Lucas, Martin ; Sellitto, Michael ; Zhang, Miranda ; Kingsland, Neerav ; Nelson Elhage ; Nicholas, Joseph ; Mercado, Noemí ; DasSarma, Nova ; Rausch, Oliver ; Larson, Robin ; McCandlish, Sam ; Johnston, Scott ; Kravec, Shauna ; Sheer El Showk ; Lanham, Tamera ; Telleen-Lawton, Timothy ; Brown, Tom ; Henighan, Tom ; Hume, Tristan ; Bai, Yuntao ; Hatfield-Dodds, Zac ; Clark, Jack ; Bowman, Samuel R ; Askell, Amanda ; Grosse, Roger ; Hernandez, Danny ; Ganguli, Deep ; Hubinger, Evan ; Schiefer, Nicholas ; Kaplan, Jared</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_27559925963</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Datasets</topic><topic>Shutdowns</topic><toplevel>online_resources</toplevel><creatorcontrib>Perez, Ethan</creatorcontrib><creatorcontrib>Ringer, Sam</creatorcontrib><creatorcontrib>Lukošiūtė, Kamilė</creatorcontrib><creatorcontrib>Nguyen, Karina</creatorcontrib><creatorcontrib>Chen, Edwin</creatorcontrib><creatorcontrib>Scott, Heiner</creatorcontrib><creatorcontrib>Pettit, Craig</creatorcontrib><creatorcontrib>Olsson, Catherine</creatorcontrib><creatorcontrib>Kundu, Sandipan</creatorcontrib><creatorcontrib>Kadavath, Saurav</creatorcontrib><creatorcontrib>Jones, Andy</creatorcontrib><creatorcontrib>Chen, Anna</creatorcontrib><creatorcontrib>Mann, Ben</creatorcontrib><creatorcontrib>Israel, Brian</creatorcontrib><creatorcontrib>Seethor, Bryan</creatorcontrib><creatorcontrib>McKinnon, Cameron</creatorcontrib><creatorcontrib>Olah, Christopher</creatorcontrib><creatorcontrib>Yan, Da</creatorcontrib><creatorcontrib>Amodei, Daniela</creatorcontrib><creatorcontrib>Amodei, Dario</creatorcontrib><creatorcontrib>Drain, Dawn</creatorcontrib><creatorcontrib>Li, Dustin</creatorcontrib><creatorcontrib>Tran-Johnson, Eli</creatorcontrib><creatorcontrib>Khundadze, Guro</creatorcontrib><creatorcontrib>Jackson Kernion</creatorcontrib><creatorcontrib>Landis, James</creatorcontrib><creatorcontrib>Kerr, Jamie</creatorcontrib><creatorcontrib>Mueller, Jared</creatorcontrib><creatorcontrib>Jeeyoon Hyun</creatorcontrib><creatorcontrib>Landau, Joshua</creatorcontrib><creatorcontrib>Ndousse, Kamal</creatorcontrib><creatorcontrib>Goldberg, Landon</creatorcontrib><creatorcontrib>Lovitt, Liane</creatorcontrib><creatorcontrib>Lucas, Martin</creatorcontrib><creatorcontrib>Sellitto, Michael</creatorcontrib><creatorcontrib>Zhang, Miranda</creatorcontrib><creatorcontrib>Kingsland, Neerav</creatorcontrib><creatorcontrib>Nelson Elhage</creatorcontrib><creatorcontrib>Nicholas, Joseph</creatorcontrib><creatorcontrib>Mercado, Noemí</creatorcontrib><creatorcontrib>DasSarma, Nova</creatorcontrib><creatorcontrib>Rausch, Oliver</creatorcontrib><creatorcontrib>Larson, Robin</creatorcontrib><creatorcontrib>McCandlish, Sam</creatorcontrib><creatorcontrib>Johnston, Scott</creatorcontrib><creatorcontrib>Kravec, Shauna</creatorcontrib><creatorcontrib>Sheer El Showk</creatorcontrib><creatorcontrib>Lanham, Tamera</creatorcontrib><creatorcontrib>Telleen-Lawton, Timothy</creatorcontrib><creatorcontrib>Brown, Tom</creatorcontrib><creatorcontrib>Henighan, Tom</creatorcontrib><creatorcontrib>Hume, Tristan</creatorcontrib><creatorcontrib>Bai, Yuntao</creatorcontrib><creatorcontrib>Hatfield-Dodds, Zac</creatorcontrib><creatorcontrib>Clark, Jack</creatorcontrib><creatorcontrib>Bowman, Samuel R</creatorcontrib><creatorcontrib>Askell, Amanda</creatorcontrib><creatorcontrib>Grosse, Roger</creatorcontrib><creatorcontrib>Hernandez, Danny</creatorcontrib><creatorcontrib>Ganguli, Deep</creatorcontrib><creatorcontrib>Hubinger, Evan</creatorcontrib><creatorcontrib>Schiefer, Nicholas</creatorcontrib><creatorcontrib>Kaplan, Jared</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Perez, Ethan</au><au>Ringer, Sam</au><au>Lukošiūtė, Kamilė</au><au>Nguyen, Karina</au><au>Chen, Edwin</au><au>Scott, Heiner</au><au>Pettit, Craig</au><au>Olsson, Catherine</au><au>Kundu, Sandipan</au><au>Kadavath, Saurav</au><au>Jones, Andy</au><au>Chen, Anna</au><au>Mann, Ben</au><au>Israel, Brian</au><au>Seethor, Bryan</au><au>McKinnon, Cameron</au><au>Olah, Christopher</au><au>Yan, Da</au><au>Amodei, Daniela</au><au>Amodei, Dario</au><au>Drain, Dawn</au><au>Li, Dustin</au><au>Tran-Johnson, Eli</au><au>Khundadze, Guro</au><au>Jackson Kernion</au><au>Landis, James</au><au>Kerr, Jamie</au><au>Mueller, Jared</au><au>Jeeyoon Hyun</au><au>Landau, Joshua</au><au>Ndousse, Kamal</au><au>Goldberg, Landon</au><au>Lovitt, Liane</au><au>Lucas, Martin</au><au>Sellitto, Michael</au><au>Zhang, Miranda</au><au>Kingsland, Neerav</au><au>Nelson Elhage</au><au>Nicholas, Joseph</au><au>Mercado, Noemí</au><au>DasSarma, Nova</au><au>Rausch, Oliver</au><au>Larson, Robin</au><au>McCandlish, Sam</au><au>Johnston, Scott</au><au>Kravec, Shauna</au><au>Sheer El Showk</au><au>Lanham, Tamera</au><au>Telleen-Lawton, Timothy</au><au>Brown, Tom</au><au>Henighan, Tom</au><au>Hume, Tristan</au><au>Bai, Yuntao</au><au>Hatfield-Dodds, Zac</au><au>Clark, Jack</au><au>Bowman, Samuel R</au><au>Askell, Amanda</au><au>Grosse, Roger</au><au>Hernandez, Danny</au><au>Ganguli, Deep</au><au>Hubinger, Evan</au><au>Schiefer, Nicholas</au><au>Kaplan, Jared</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Discovering Language Model Behaviors with Model-Written Evaluations</atitle><jtitle>arXiv.org</jtitle><date>2022-12-19</date><risdate>2022</risdate><eissn>2331-8422</eissn><abstract>As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2022-12 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2755992596 |
source | Free E- Journals |
subjects | Datasets Shutdowns |
title | Discovering Language Model Behaviors with Model-Written Evaluations |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-06T16%3A37%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Discovering%20Language%20Model%20Behaviors%20with%20Model-Written%20Evaluations&rft.jtitle=arXiv.org&rft.au=Perez,%20Ethan&rft.date=2022-12-19&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2755992596%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2755992596&rft_id=info:pmid/&rfr_iscdi=true |