Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking
Reward models play a key role in aligning language model applications towards human preferences. However, this setup creates an incentive for the language model to exploit errors in the reward model to achieve high estimated reward, a phenomenon often termed \emph{reward hacking}. A natural mitigati...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Eisenstein, Jacob Nagpal, Chirag Agarwal, Alekh Beirami, Ahmad D'Amour, Alex Dvijotham, DJ Fisch, Adam Heller, Katherine Pfohl, Stephen Ramachandran, Deepak Shaw, Peter Berant, Jonathan |
description | Reward models play a key role in aligning language model applications towards
human preferences. However, this setup creates an incentive for the language
model to exploit errors in the reward model to achieve high estimated reward, a
phenomenon often termed \emph{reward hacking}. A natural mitigation is to train
an ensemble of reward models, aggregating over model outputs to obtain a more
robust reward estimate. We explore the application of reward ensembles to
alignment at both training time (through reinforcement learning) and inference
time (through reranking). First, we show that reward models are
\emph{underspecified}: reward models that perform similarly in-distribution can
yield very different rewards when used in alignment, due to distribution shift.
Second, underspecification results in overoptimization, where alignment to one
reward model does not improve reward as measured by another reward model
trained on the same data. Third, overoptimization is mitigated by the use of
reward ensembles, and ensembles that vary by their \emph{pretraining} seeds
lead to better generalization than ensembles that differ only by their
\emph{fine-tuning} seeds, with both outperforming individual reward models.
However, even pretrain reward ensembles do not eliminate reward hacking: we
show several qualitative reward hacking phenomena that are not mitigated by
ensembling because all reward models in the ensemble exhibit similar error
patterns. |
doi_str_mv | 10.48550/arxiv.2312.09244 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2312_09244</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2312_09244</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-2c1a55e1f7a76c2a4461c2d731fdd5dfbcc6276664c0575c63dbb4537ecbacc63</originalsourceid><addsrcrecordid>eNotj8tKxEAURHvjQkY_wJX3BxLT75mVyBCNMIMgAy7D7e6boTGPoRNff2_msaqiiio4jN3xIldLrYsHTL_xOxeSi7xYCaWu2UdF7SH2exgSVJTCbB_hnX4wBdgOgVoo-5E619II2zjFPU4E7muCMEA_TFC2sYv9MbyMKvSf88kNu2qwHen2ogu2ey536yrbvL28rp82GRqrMuE5ak28sWiNF6iU4V4EK3kTgg6N894Ia4xRvtBWeyODc0pLS97h3MkFuz_fnsjqQ4odpr_6SFifCOU_NQtL-g</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking</title><source>arXiv.org</source><creator>Eisenstein, Jacob ; Nagpal, Chirag ; Agarwal, Alekh ; Beirami, Ahmad ; D'Amour, Alex ; Dvijotham, DJ ; Fisch, Adam ; Heller, Katherine ; Pfohl, Stephen ; Ramachandran, Deepak ; Shaw, Peter ; Berant, Jonathan</creator><creatorcontrib>Eisenstein, Jacob ; Nagpal, Chirag ; Agarwal, Alekh ; Beirami, Ahmad ; D'Amour, Alex ; Dvijotham, DJ ; Fisch, Adam ; Heller, Katherine ; Pfohl, Stephen ; Ramachandran, Deepak ; Shaw, Peter ; Berant, Jonathan</creatorcontrib><description>Reward models play a key role in aligning language model applications towards
human preferences. However, this setup creates an incentive for the language
model to exploit errors in the reward model to achieve high estimated reward, a
phenomenon often termed \emph{reward hacking}. A natural mitigation is to train
an ensemble of reward models, aggregating over model outputs to obtain a more
robust reward estimate. We explore the application of reward ensembles to
alignment at both training time (through reinforcement learning) and inference
time (through reranking). First, we show that reward models are
\emph{underspecified}: reward models that perform similarly in-distribution can
yield very different rewards when used in alignment, due to distribution shift.
Second, underspecification results in overoptimization, where alignment to one
reward model does not improve reward as measured by another reward model
trained on the same data. Third, overoptimization is mitigated by the use of
reward ensembles, and ensembles that vary by their \emph{pretraining} seeds
lead to better generalization than ensembles that differ only by their
\emph{fine-tuning} seeds, with both outperforming individual reward models.
However, even pretrain reward ensembles do not eliminate reward hacking: we
show several qualitative reward hacking phenomena that are not mitigated by
ensembling because all reward models in the ensemble exhibit similar error
patterns.</description><identifier>DOI: 10.48550/arxiv.2312.09244</identifier><language>eng</language><subject>Computer Science - Learning</subject><creationdate>2023-12</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2312.09244$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2312.09244$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Eisenstein, Jacob</creatorcontrib><creatorcontrib>Nagpal, Chirag</creatorcontrib><creatorcontrib>Agarwal, Alekh</creatorcontrib><creatorcontrib>Beirami, Ahmad</creatorcontrib><creatorcontrib>D'Amour, Alex</creatorcontrib><creatorcontrib>Dvijotham, DJ</creatorcontrib><creatorcontrib>Fisch, Adam</creatorcontrib><creatorcontrib>Heller, Katherine</creatorcontrib><creatorcontrib>Pfohl, Stephen</creatorcontrib><creatorcontrib>Ramachandran, Deepak</creatorcontrib><creatorcontrib>Shaw, Peter</creatorcontrib><creatorcontrib>Berant, Jonathan</creatorcontrib><title>Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking</title><description>Reward models play a key role in aligning language model applications towards
human preferences. However, this setup creates an incentive for the language
model to exploit errors in the reward model to achieve high estimated reward, a
phenomenon often termed \emph{reward hacking}. A natural mitigation is to train
an ensemble of reward models, aggregating over model outputs to obtain a more
robust reward estimate. We explore the application of reward ensembles to
alignment at both training time (through reinforcement learning) and inference
time (through reranking). First, we show that reward models are
\emph{underspecified}: reward models that perform similarly in-distribution can
yield very different rewards when used in alignment, due to distribution shift.
Second, underspecification results in overoptimization, where alignment to one
reward model does not improve reward as measured by another reward model
trained on the same data. Third, overoptimization is mitigated by the use of
reward ensembles, and ensembles that vary by their \emph{pretraining} seeds
lead to better generalization than ensembles that differ only by their
\emph{fine-tuning} seeds, with both outperforming individual reward models.
However, even pretrain reward ensembles do not eliminate reward hacking: we
show several qualitative reward hacking phenomena that are not mitigated by
ensembling because all reward models in the ensemble exhibit similar error
patterns.</description><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tKxEAURHvjQkY_wJX3BxLT75mVyBCNMIMgAy7D7e6boTGPoRNff2_msaqiiio4jN3xIldLrYsHTL_xOxeSi7xYCaWu2UdF7SH2exgSVJTCbB_hnX4wBdgOgVoo-5E619II2zjFPU4E7muCMEA_TFC2sYv9MbyMKvSf88kNu2qwHen2ogu2ey536yrbvL28rp82GRqrMuE5ak28sWiNF6iU4V4EK3kTgg6N894Ia4xRvtBWeyODc0pLS97h3MkFuz_fnsjqQ4odpr_6SFifCOU_NQtL-g</recordid><startdate>20231214</startdate><enddate>20231214</enddate><creator>Eisenstein, Jacob</creator><creator>Nagpal, Chirag</creator><creator>Agarwal, Alekh</creator><creator>Beirami, Ahmad</creator><creator>D'Amour, Alex</creator><creator>Dvijotham, DJ</creator><creator>Fisch, Adam</creator><creator>Heller, Katherine</creator><creator>Pfohl, Stephen</creator><creator>Ramachandran, Deepak</creator><creator>Shaw, Peter</creator><creator>Berant, Jonathan</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231214</creationdate><title>Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking</title><author>Eisenstein, Jacob ; Nagpal, Chirag ; Agarwal, Alekh ; Beirami, Ahmad ; D'Amour, Alex ; Dvijotham, DJ ; Fisch, Adam ; Heller, Katherine ; Pfohl, Stephen ; Ramachandran, Deepak ; Shaw, Peter ; Berant, Jonathan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-2c1a55e1f7a76c2a4461c2d731fdd5dfbcc6276664c0575c63dbb4537ecbacc63</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Eisenstein, Jacob</creatorcontrib><creatorcontrib>Nagpal, Chirag</creatorcontrib><creatorcontrib>Agarwal, Alekh</creatorcontrib><creatorcontrib>Beirami, Ahmad</creatorcontrib><creatorcontrib>D'Amour, Alex</creatorcontrib><creatorcontrib>Dvijotham, DJ</creatorcontrib><creatorcontrib>Fisch, Adam</creatorcontrib><creatorcontrib>Heller, Katherine</creatorcontrib><creatorcontrib>Pfohl, Stephen</creatorcontrib><creatorcontrib>Ramachandran, Deepak</creatorcontrib><creatorcontrib>Shaw, Peter</creatorcontrib><creatorcontrib>Berant, Jonathan</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Eisenstein, Jacob</au><au>Nagpal, Chirag</au><au>Agarwal, Alekh</au><au>Beirami, Ahmad</au><au>D'Amour, Alex</au><au>Dvijotham, DJ</au><au>Fisch, Adam</au><au>Heller, Katherine</au><au>Pfohl, Stephen</au><au>Ramachandran, Deepak</au><au>Shaw, Peter</au><au>Berant, Jonathan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking</atitle><date>2023-12-14</date><risdate>2023</risdate><abstract>Reward models play a key role in aligning language model applications towards
human preferences. However, this setup creates an incentive for the language
model to exploit errors in the reward model to achieve high estimated reward, a
phenomenon often termed \emph{reward hacking}. A natural mitigation is to train
an ensemble of reward models, aggregating over model outputs to obtain a more
robust reward estimate. We explore the application of reward ensembles to
alignment at both training time (through reinforcement learning) and inference
time (through reranking). First, we show that reward models are
\emph{underspecified}: reward models that perform similarly in-distribution can
yield very different rewards when used in alignment, due to distribution shift.
Second, underspecification results in overoptimization, where alignment to one
reward model does not improve reward as measured by another reward model
trained on the same data. Third, overoptimization is mitigated by the use of
reward ensembles, and ensembles that vary by their \emph{pretraining} seeds
lead to better generalization than ensembles that differ only by their
\emph{fine-tuning} seeds, with both outperforming individual reward models.
However, even pretrain reward ensembles do not eliminate reward hacking: we
show several qualitative reward hacking phenomena that are not mitigated by
ensembling because all reward models in the ensemble exhibit similar error
patterns.</abstract><doi>10.48550/arxiv.2312.09244</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2312.09244 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2312_09244 |
source | arXiv.org |
subjects | Computer Science - Learning |
title | Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-04T22%3A09%3A08IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Helping%20or%20Herding?%20Reward%20Model%20Ensembles%20Mitigate%20but%20do%20not%20Eliminate%20Reward%20Hacking&rft.au=Eisenstein,%20Jacob&rft.date=2023-12-14&rft_id=info:doi/10.48550/arxiv.2312.09244&rft_dat=%3Carxiv_GOX%3E2312_09244%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |