Fine-tuning language models to find agreement among humans with diverse preferences

Recent work in large language modeling (LLMs) has used fine-tuning to align outputs with the preferences of a prototypical user. This work assumes that human preferences are static and homogeneous across individuals, so that aligning to a a single "generic" user will confer more general al...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Bakker, Michiel A, Chadwick, Martin J, Sheahan, Hannah R, Tessler, Michael Henry, Campbell-Gillingham, Lucy, Balaguer, Jan, McAleese, Nat, Glaese, Amelia, Aslanides, John, Botvinick, Matthew M, Summerfield, Christopher
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Bakker, Michiel A Chadwick, Martin J Sheahan, Hannah R Tessler, Michael Henry Campbell-Gillingham, Lucy Balaguer, Jan McAleese, Nat Glaese, Amelia Aslanides, John Botvinick, Matthew M Summerfield, Christopher
description	Recent work in large language modeling (LLMs) has used fine-tuning to align outputs with the preferences of a prototypical user. This work assumes that human preferences are static and homogeneous across individuals, so that aligning to a a single "generic" user will confer more general alignment. Here, we embrace the heterogeneity of human preferences to consider a different challenge: how might a machine help people with diverse views find agreement? We fine-tune a 70 billion parameter LLM to generate statements that maximize the expected approval for a group of people with potentially diverse opinions. Human participants provide written opinions on thousands of questions touching on moral and political issues (e.g., "should we raise taxes on the rich?"), and rate the LLM's generated candidate consensus statements for agreement and quality. A reward model is then trained to predict individual preferences, enabling it to quantify and rank consensus statements in terms of their appeal to the overall group, defined according to different aggregation (social welfare) functions. The model produces consensus statements that are preferred by human users over those from prompted LLMs (>70%) and significantly outperforms a tight fine-tuned baseline that lacks the final ranking step. Further, our best model's consensus statements are preferred over the best human-generated opinions (>65%). We find that when we silently constructed consensus statements from only a subset of group members, those who were excluded were more likely to dissent, revealing the sensitivity of the consensus to individual contributions. These results highlight the potential to use LLMs to help groups of humans align their values with one another.
doi_str_mv	10.48550/arxiv.2211.15006
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2211_15006</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2211_15006</sourcerecordid><originalsourceid>FETCH-LOGICAL-a676-6da2530a1078b5e35b8fb822dd857b2ec426b8100f48864098c08c21190273eb3</originalsourceid><addsrcrecordid>eNotz7FOwzAUBVAvDKjwAUz4BxKendhxR1RRQKrE0O7Rc_ySWoqdyk4K_D2lMN3l6uoexh4ElLVRCp4wfflzKaUQpVAA-pbttz5SMS_Rx4GPGIcFB-JhcjRmPk-899FxHBJRoDhzDNOld1wCxsw__Xzkzp8pZeKnRD0lih3lO3bT45jp_j9X7LB9OWzeit3H6_vmeVegbnShHUpVAQpojFVUKWt6a6R0zqjGSupqqa0RAH1tjK5hbTow3eX6GmRTka1W7PFv9qpqT8kHTN_tr6696qofZiNKLg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Fine-tuning language models to find agreement among humans with diverse preferences</title><source>arXiv.org</source><creator>Bakker, Michiel A ; Chadwick, Martin J ; Sheahan, Hannah R ; Tessler, Michael Henry ; Campbell-Gillingham, Lucy ; Balaguer, Jan ; McAleese, Nat ; Glaese, Amelia ; Aslanides, John ; Botvinick, Matthew M ; Summerfield, Christopher</creator><creatorcontrib>Bakker, Michiel A ; Chadwick, Martin J ; Sheahan, Hannah R ; Tessler, Michael Henry ; Campbell-Gillingham, Lucy ; Balaguer, Jan ; McAleese, Nat ; Glaese, Amelia ; Aslanides, John ; Botvinick, Matthew M ; Summerfield, Christopher</creatorcontrib><description>Recent work in large language modeling (LLMs) has used fine-tuning to align outputs with the preferences of a prototypical user. This work assumes that human preferences are static and homogeneous across individuals, so that aligning to a a single "generic" user will confer more general alignment. Here, we embrace the heterogeneity of human preferences to consider a different challenge: how might a machine help people with diverse views find agreement? We fine-tune a 70 billion parameter LLM to generate statements that maximize the expected approval for a group of people with potentially diverse opinions. Human participants provide written opinions on thousands of questions touching on moral and political issues (e.g., "should we raise taxes on the rich?"), and rate the LLM's generated candidate consensus statements for agreement and quality. A reward model is then trained to predict individual preferences, enabling it to quantify and rank consensus statements in terms of their appeal to the overall group, defined according to different aggregation (social welfare) functions. The model produces consensus statements that are preferred by human users over those from prompted LLMs (>70%) and significantly outperforms a tight fine-tuned baseline that lacks the final ranking step. Further, our best model's consensus statements are preferred over the best human-generated opinions (>65%). We find that when we silently constructed consensus statements from only a subset of group members, those who were excluded were more likely to dissent, revealing the sensitivity of the consensus to individual contributions. These results highlight the potential to use LLMs to help groups of humans align their values with one another.</description><identifier>DOI: 10.48550/arxiv.2211.15006</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Learning</subject><creationdate>2022-11</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2211.15006$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2211.15006$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Bakker, Michiel A</creatorcontrib><creatorcontrib>Chadwick, Martin J</creatorcontrib><creatorcontrib>Sheahan, Hannah R</creatorcontrib><creatorcontrib>Tessler, Michael Henry</creatorcontrib><creatorcontrib>Campbell-Gillingham, Lucy</creatorcontrib><creatorcontrib>Balaguer, Jan</creatorcontrib><creatorcontrib>McAleese, Nat</creatorcontrib><creatorcontrib>Glaese, Amelia</creatorcontrib><creatorcontrib>Aslanides, John</creatorcontrib><creatorcontrib>Botvinick, Matthew M</creatorcontrib><creatorcontrib>Summerfield, Christopher</creatorcontrib><title>Fine-tuning language models to find agreement among humans with diverse preferences</title><description>Recent work in large language modeling (LLMs) has used fine-tuning to align outputs with the preferences of a prototypical user. This work assumes that human preferences are static and homogeneous across individuals, so that aligning to a a single "generic" user will confer more general alignment. Here, we embrace the heterogeneity of human preferences to consider a different challenge: how might a machine help people with diverse views find agreement? We fine-tune a 70 billion parameter LLM to generate statements that maximize the expected approval for a group of people with potentially diverse opinions. Human participants provide written opinions on thousands of questions touching on moral and political issues (e.g., "should we raise taxes on the rich?"), and rate the LLM's generated candidate consensus statements for agreement and quality. A reward model is then trained to predict individual preferences, enabling it to quantify and rank consensus statements in terms of their appeal to the overall group, defined according to different aggregation (social welfare) functions. The model produces consensus statements that are preferred by human users over those from prompted LLMs (>70%) and significantly outperforms a tight fine-tuned baseline that lacks the final ranking step. Further, our best model's consensus statements are preferred over the best human-generated opinions (>65%). We find that when we silently constructed consensus statements from only a subset of group members, those who were excluded were more likely to dissent, revealing the sensitivity of the consensus to individual contributions. These results highlight the potential to use LLMs to help groups of humans align their values with one another.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz7FOwzAUBVAvDKjwAUz4BxKendhxR1RRQKrE0O7Rc_ySWoqdyk4K_D2lMN3l6uoexh4ElLVRCp4wfflzKaUQpVAA-pbttz5SMS_Rx4GPGIcFB-JhcjRmPk-899FxHBJRoDhzDNOld1wCxsw__Xzkzp8pZeKnRD0lih3lO3bT45jp_j9X7LB9OWzeit3H6_vmeVegbnShHUpVAQpojFVUKWt6a6R0zqjGSupqqa0RAH1tjK5hbTow3eX6GmRTka1W7PFv9qpqT8kHTN_tr6696qofZiNKLg</recordid><startdate>20221127</startdate><enddate>20221127</enddate><creator>Bakker, Michiel A</creator><creator>Chadwick, Martin J</creator><creator>Sheahan, Hannah R</creator><creator>Tessler, Michael Henry</creator><creator>Campbell-Gillingham, Lucy</creator><creator>Balaguer, Jan</creator><creator>McAleese, Nat</creator><creator>Glaese, Amelia</creator><creator>Aslanides, John</creator><creator>Botvinick, Matthew M</creator><creator>Summerfield, Christopher</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20221127</creationdate><title>Fine-tuning language models to find agreement among humans with diverse preferences</title><author>Bakker, Michiel A ; Chadwick, Martin J ; Sheahan, Hannah R ; Tessler, Michael Henry ; Campbell-Gillingham, Lucy ; Balaguer, Jan ; McAleese, Nat ; Glaese, Amelia ; Aslanides, John ; Botvinick, Matthew M ; Summerfield, Christopher</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a676-6da2530a1078b5e35b8fb822dd857b2ec426b8100f48864098c08c21190273eb3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Bakker, Michiel A</creatorcontrib><creatorcontrib>Chadwick, Martin J</creatorcontrib><creatorcontrib>Sheahan, Hannah R</creatorcontrib><creatorcontrib>Tessler, Michael Henry</creatorcontrib><creatorcontrib>Campbell-Gillingham, Lucy</creatorcontrib><creatorcontrib>Balaguer, Jan</creatorcontrib><creatorcontrib>McAleese, Nat</creatorcontrib><creatorcontrib>Glaese, Amelia</creatorcontrib><creatorcontrib>Aslanides, John</creatorcontrib><creatorcontrib>Botvinick, Matthew M</creatorcontrib><creatorcontrib>Summerfield, Christopher</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Bakker, Michiel A</au><au>Chadwick, Martin J</au><au>Sheahan, Hannah R</au><au>Tessler, Michael Henry</au><au>Campbell-Gillingham, Lucy</au><au>Balaguer, Jan</au><au>McAleese, Nat</au><au>Glaese, Amelia</au><au>Aslanides, John</au><au>Botvinick, Matthew M</au><au>Summerfield, Christopher</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Fine-tuning language models to find agreement among humans with diverse preferences</atitle><date>2022-11-27</date><risdate>2022</risdate><abstract>Recent work in large language modeling (LLMs) has used fine-tuning to align outputs with the preferences of a prototypical user. This work assumes that human preferences are static and homogeneous across individuals, so that aligning to a a single "generic" user will confer more general alignment. Here, we embrace the heterogeneity of human preferences to consider a different challenge: how might a machine help people with diverse views find agreement? We fine-tune a 70 billion parameter LLM to generate statements that maximize the expected approval for a group of people with potentially diverse opinions. Human participants provide written opinions on thousands of questions touching on moral and political issues (e.g., "should we raise taxes on the rich?"), and rate the LLM's generated candidate consensus statements for agreement and quality. A reward model is then trained to predict individual preferences, enabling it to quantify and rank consensus statements in terms of their appeal to the overall group, defined according to different aggregation (social welfare) functions. The model produces consensus statements that are preferred by human users over those from prompted LLMs (>70%) and significantly outperforms a tight fine-tuned baseline that lacks the final ranking step. Further, our best model's consensus statements are preferred over the best human-generated opinions (>65%). We find that when we silently constructed consensus statements from only a subset of group members, those who were excluded were more likely to dissent, revealing the sensitivity of the consensus to individual contributions. These results highlight the potential to use LLMs to help groups of humans align their values with one another.</abstract><doi>10.48550/arxiv.2211.15006</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2211.15006
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2211_15006
source	arXiv.org
subjects	Computer Science - Computation and Language Computer Science - Learning
title	Fine-tuning language models to find agreement among humans with diverse preferences
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-30T11%3A22%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Fine-tuning%20language%20models%20to%20find%20agreement%20among%20humans%20with%20diverse%20preferences&rft.au=Bakker,%20Michiel%20A&rft.date=2022-11-27&rft_id=info:doi/10.48550/arxiv.2211.15006&rft_dat=%3Carxiv_GOX%3E2211_15006%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true