Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking

Reward models play a key role in aligning language model applications towards human preferences. However, this setup creates an incentive for the language model to exploit errors in the reward model to achieve high estimated reward, a phenomenon often termed \emph{reward hacking}. A natural mitigati...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Eisenstein, Jacob, Nagpal, Chirag, Agarwal, Alekh, Beirami, Ahmad, D'Amour, Alex, Dvijotham, DJ, Fisch, Adam, Heller, Katherine, Pfohl, Stephen, Ramachandran, Deepak, Shaw, Peter, Berant, Jonathan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Eisenstein, Jacob
Nagpal, Chirag
Agarwal, Alekh
Beirami, Ahmad
D'Amour, Alex
Dvijotham, DJ
Fisch, Adam
Heller, Katherine
Pfohl, Stephen
Ramachandran, Deepak
Shaw, Peter
Berant, Jonathan
description Reward models play a key role in aligning language model applications towards human preferences. However, this setup creates an incentive for the language model to exploit errors in the reward model to achieve high estimated reward, a phenomenon often termed \emph{reward hacking}. A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate. We explore the application of reward ensembles to alignment at both training time (through reinforcement learning) and inference time (through reranking). First, we show that reward models are \emph{underspecified}: reward models that perform similarly in-distribution can yield very different rewards when used in alignment, due to distribution shift. Second, underspecification results in overoptimization, where alignment to one reward model does not improve reward as measured by another reward model trained on the same data. Third, overoptimization is mitigated by the use of reward ensembles, and ensembles that vary by their \emph{pretraining} seeds lead to better generalization than ensembles that differ only by their \emph{fine-tuning} seeds, with both outperforming individual reward models. However, even pretrain reward ensembles do not eliminate reward hacking: we show several qualitative reward hacking phenomena that are not mitigated by ensembling because all reward models in the ensemble exhibit similar error patterns.
doi_str_mv 10.48550/arxiv.2312.09244
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2312_09244</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2312_09244</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-2c1a55e1f7a76c2a4461c2d731fdd5dfbcc6276664c0575c63dbb4537ecbacc63</originalsourceid><addsrcrecordid>eNotj8tKxEAURHvjQkY_wJX3BxLT75mVyBCNMIMgAy7D7e6boTGPoRNff2_msaqiiio4jN3xIldLrYsHTL_xOxeSi7xYCaWu2UdF7SH2exgSVJTCbB_hnX4wBdgOgVoo-5E619II2zjFPU4E7muCMEA_TFC2sYv9MbyMKvSf88kNu2qwHen2ogu2ey536yrbvL28rp82GRqrMuE5ak28sWiNF6iU4V4EK3kTgg6N894Ia4xRvtBWeyODc0pLS97h3MkFuz_fnsjqQ4odpr_6SFifCOU_NQtL-g</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking</title><source>arXiv.org</source><creator>Eisenstein, Jacob ; Nagpal, Chirag ; Agarwal, Alekh ; Beirami, Ahmad ; D'Amour, Alex ; Dvijotham, DJ ; Fisch, Adam ; Heller, Katherine ; Pfohl, Stephen ; Ramachandran, Deepak ; Shaw, Peter ; Berant, Jonathan</creator><creatorcontrib>Eisenstein, Jacob ; Nagpal, Chirag ; Agarwal, Alekh ; Beirami, Ahmad ; D'Amour, Alex ; Dvijotham, DJ ; Fisch, Adam ; Heller, Katherine ; Pfohl, Stephen ; Ramachandran, Deepak ; Shaw, Peter ; Berant, Jonathan</creatorcontrib><description>Reward models play a key role in aligning language model applications towards human preferences. However, this setup creates an incentive for the language model to exploit errors in the reward model to achieve high estimated reward, a phenomenon often termed \emph{reward hacking}. A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate. We explore the application of reward ensembles to alignment at both training time (through reinforcement learning) and inference time (through reranking). First, we show that reward models are \emph{underspecified}: reward models that perform similarly in-distribution can yield very different rewards when used in alignment, due to distribution shift. Second, underspecification results in overoptimization, where alignment to one reward model does not improve reward as measured by another reward model trained on the same data. Third, overoptimization is mitigated by the use of reward ensembles, and ensembles that vary by their \emph{pretraining} seeds lead to better generalization than ensembles that differ only by their \emph{fine-tuning} seeds, with both outperforming individual reward models. However, even pretrain reward ensembles do not eliminate reward hacking: we show several qualitative reward hacking phenomena that are not mitigated by ensembling because all reward models in the ensemble exhibit similar error patterns.</description><identifier>DOI: 10.48550/arxiv.2312.09244</identifier><language>eng</language><subject>Computer Science - Learning</subject><creationdate>2023-12</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2312.09244$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2312.09244$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Eisenstein, Jacob</creatorcontrib><creatorcontrib>Nagpal, Chirag</creatorcontrib><creatorcontrib>Agarwal, Alekh</creatorcontrib><creatorcontrib>Beirami, Ahmad</creatorcontrib><creatorcontrib>D'Amour, Alex</creatorcontrib><creatorcontrib>Dvijotham, DJ</creatorcontrib><creatorcontrib>Fisch, Adam</creatorcontrib><creatorcontrib>Heller, Katherine</creatorcontrib><creatorcontrib>Pfohl, Stephen</creatorcontrib><creatorcontrib>Ramachandran, Deepak</creatorcontrib><creatorcontrib>Shaw, Peter</creatorcontrib><creatorcontrib>Berant, Jonathan</creatorcontrib><title>Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking</title><description>Reward models play a key role in aligning language model applications towards human preferences. However, this setup creates an incentive for the language model to exploit errors in the reward model to achieve high estimated reward, a phenomenon often termed \emph{reward hacking}. A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate. We explore the application of reward ensembles to alignment at both training time (through reinforcement learning) and inference time (through reranking). First, we show that reward models are \emph{underspecified}: reward models that perform similarly in-distribution can yield very different rewards when used in alignment, due to distribution shift. Second, underspecification results in overoptimization, where alignment to one reward model does not improve reward as measured by another reward model trained on the same data. Third, overoptimization is mitigated by the use of reward ensembles, and ensembles that vary by their \emph{pretraining} seeds lead to better generalization than ensembles that differ only by their \emph{fine-tuning} seeds, with both outperforming individual reward models. However, even pretrain reward ensembles do not eliminate reward hacking: we show several qualitative reward hacking phenomena that are not mitigated by ensembling because all reward models in the ensemble exhibit similar error patterns.</description><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tKxEAURHvjQkY_wJX3BxLT75mVyBCNMIMgAy7D7e6boTGPoRNff2_msaqiiio4jN3xIldLrYsHTL_xOxeSi7xYCaWu2UdF7SH2exgSVJTCbB_hnX4wBdgOgVoo-5E619II2zjFPU4E7muCMEA_TFC2sYv9MbyMKvSf88kNu2qwHen2ogu2ey536yrbvL28rp82GRqrMuE5ak28sWiNF6iU4V4EK3kTgg6N894Ia4xRvtBWeyODc0pLS97h3MkFuz_fnsjqQ4odpr_6SFifCOU_NQtL-g</recordid><startdate>20231214</startdate><enddate>20231214</enddate><creator>Eisenstein, Jacob</creator><creator>Nagpal, Chirag</creator><creator>Agarwal, Alekh</creator><creator>Beirami, Ahmad</creator><creator>D'Amour, Alex</creator><creator>Dvijotham, DJ</creator><creator>Fisch, Adam</creator><creator>Heller, Katherine</creator><creator>Pfohl, Stephen</creator><creator>Ramachandran, Deepak</creator><creator>Shaw, Peter</creator><creator>Berant, Jonathan</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231214</creationdate><title>Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking</title><author>Eisenstein, Jacob ; Nagpal, Chirag ; Agarwal, Alekh ; Beirami, Ahmad ; D'Amour, Alex ; Dvijotham, DJ ; Fisch, Adam ; Heller, Katherine ; Pfohl, Stephen ; Ramachandran, Deepak ; Shaw, Peter ; Berant, Jonathan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-2c1a55e1f7a76c2a4461c2d731fdd5dfbcc6276664c0575c63dbb4537ecbacc63</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Eisenstein, Jacob</creatorcontrib><creatorcontrib>Nagpal, Chirag</creatorcontrib><creatorcontrib>Agarwal, Alekh</creatorcontrib><creatorcontrib>Beirami, Ahmad</creatorcontrib><creatorcontrib>D'Amour, Alex</creatorcontrib><creatorcontrib>Dvijotham, DJ</creatorcontrib><creatorcontrib>Fisch, Adam</creatorcontrib><creatorcontrib>Heller, Katherine</creatorcontrib><creatorcontrib>Pfohl, Stephen</creatorcontrib><creatorcontrib>Ramachandran, Deepak</creatorcontrib><creatorcontrib>Shaw, Peter</creatorcontrib><creatorcontrib>Berant, Jonathan</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Eisenstein, Jacob</au><au>Nagpal, Chirag</au><au>Agarwal, Alekh</au><au>Beirami, Ahmad</au><au>D'Amour, Alex</au><au>Dvijotham, DJ</au><au>Fisch, Adam</au><au>Heller, Katherine</au><au>Pfohl, Stephen</au><au>Ramachandran, Deepak</au><au>Shaw, Peter</au><au>Berant, Jonathan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking</atitle><date>2023-12-14</date><risdate>2023</risdate><abstract>Reward models play a key role in aligning language model applications towards human preferences. However, this setup creates an incentive for the language model to exploit errors in the reward model to achieve high estimated reward, a phenomenon often termed \emph{reward hacking}. A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate. We explore the application of reward ensembles to alignment at both training time (through reinforcement learning) and inference time (through reranking). First, we show that reward models are \emph{underspecified}: reward models that perform similarly in-distribution can yield very different rewards when used in alignment, due to distribution shift. Second, underspecification results in overoptimization, where alignment to one reward model does not improve reward as measured by another reward model trained on the same data. Third, overoptimization is mitigated by the use of reward ensembles, and ensembles that vary by their \emph{pretraining} seeds lead to better generalization than ensembles that differ only by their \emph{fine-tuning} seeds, with both outperforming individual reward models. However, even pretrain reward ensembles do not eliminate reward hacking: we show several qualitative reward hacking phenomena that are not mitigated by ensembling because all reward models in the ensemble exhibit similar error patterns.</abstract><doi>10.48550/arxiv.2312.09244</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2312.09244
ispartof
issn
language eng
recordid cdi_arxiv_primary_2312_09244
source arXiv.org
subjects Computer Science - Learning
title Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-04T22%3A09%3A08IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Helping%20or%20Herding?%20Reward%20Model%20Ensembles%20Mitigate%20but%20do%20not%20Eliminate%20Reward%20Hacking&rft.au=Eisenstein,%20Jacob&rft.date=2023-12-14&rft_id=info:doi/10.48550/arxiv.2312.09244&rft_dat=%3Carxiv_GOX%3E2312_09244%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true