Reward Generalization in RLHF: A Topological Perspective
Existing alignment methods share a common topology of information flow, where reward information is collected from humans, modeled with preference learning, and used to tune language models. However, this shared topology has not been systematically characterized, nor have its alternatives been thoro...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Qiu, Tianyi Zeng, Fanzhi Ji, Jiaming Yan, Dong Wang, Kaile Zhou, Jiayi Han, Yang Dai, Josef Pan, Xuehai Yang, Yaodong |
description | Existing alignment methods share a common topology of information flow, where
reward information is collected from humans, modeled with preference learning,
and used to tune language models. However, this shared topology has not been
systematically characterized, nor have its alternatives been thoroughly
explored, leaving the problems of low data efficiency and unreliable
generalization unaddressed. As a solution, we introduce a theoretical framework
for investigating reward generalization in reinforcement learning from human
feedback (RLHF), focusing on the topology of information flow at both macro and
micro levels. At the macro level, we portray the RLHF information flow as an
autoencoding process over behavior distributions, formalizing the RLHF
objective of distributional consistency between human preference and model
behavior. At the micro level, we present induced Bayesian networks as a theory
of reward generalization in RLHF, introducing fine-grained dataset topologies
into generalization bounds. Combining analysis on both levels, we propose
reward modeling from tree-structured preference information. It is shown to
reduce reward uncertainty by up to $\Theta(\log n/\log\log n)$ times compared
to baselines, where $n$ is the dataset size. Validation on three NLP tasks
shows that our tree-based reward model achieves an average win rate of 65%
against baseline methods, thus improving reward generalization for free via
topology design. |
doi_str_mv | 10.48550/arxiv.2402.10184 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2402_10184</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2402_10184</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2402_101843</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw0jM0MLQw4WSwCEotTyxKUXBPzUstSszJrEosyczPU8jMUwjy8XCzUnBUCMkvyM_JT89MTsxRCEgtKi5ITS7JLEvlYWBNS8wpTuWF0twM8m6uIc4eumA74guKMnMTiyrjQXbFg-0yJqwCAMjvMxQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Reward Generalization in RLHF: A Topological Perspective</title><source>arXiv.org</source><creator>Qiu, Tianyi ; Zeng, Fanzhi ; Ji, Jiaming ; Yan, Dong ; Wang, Kaile ; Zhou, Jiayi ; Han, Yang ; Dai, Josef ; Pan, Xuehai ; Yang, Yaodong</creator><creatorcontrib>Qiu, Tianyi ; Zeng, Fanzhi ; Ji, Jiaming ; Yan, Dong ; Wang, Kaile ; Zhou, Jiayi ; Han, Yang ; Dai, Josef ; Pan, Xuehai ; Yang, Yaodong</creatorcontrib><description>Existing alignment methods share a common topology of information flow, where
reward information is collected from humans, modeled with preference learning,
and used to tune language models. However, this shared topology has not been
systematically characterized, nor have its alternatives been thoroughly
explored, leaving the problems of low data efficiency and unreliable
generalization unaddressed. As a solution, we introduce a theoretical framework
for investigating reward generalization in reinforcement learning from human
feedback (RLHF), focusing on the topology of information flow at both macro and
micro levels. At the macro level, we portray the RLHF information flow as an
autoencoding process over behavior distributions, formalizing the RLHF
objective of distributional consistency between human preference and model
behavior. At the micro level, we present induced Bayesian networks as a theory
of reward generalization in RLHF, introducing fine-grained dataset topologies
into generalization bounds. Combining analysis on both levels, we propose
reward modeling from tree-structured preference information. It is shown to
reduce reward uncertainty by up to $\Theta(\log n/\log\log n)$ times compared
to baselines, where $n$ is the dataset size. Validation on three NLP tasks
shows that our tree-based reward model achieves an average win rate of 65%
against baseline methods, thus improving reward generalization for free via
topology design.</description><identifier>DOI: 10.48550/arxiv.2402.10184</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Discrete Mathematics ; Computer Science - Learning</subject><creationdate>2024-02</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2402.10184$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2402.10184$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Qiu, Tianyi</creatorcontrib><creatorcontrib>Zeng, Fanzhi</creatorcontrib><creatorcontrib>Ji, Jiaming</creatorcontrib><creatorcontrib>Yan, Dong</creatorcontrib><creatorcontrib>Wang, Kaile</creatorcontrib><creatorcontrib>Zhou, Jiayi</creatorcontrib><creatorcontrib>Han, Yang</creatorcontrib><creatorcontrib>Dai, Josef</creatorcontrib><creatorcontrib>Pan, Xuehai</creatorcontrib><creatorcontrib>Yang, Yaodong</creatorcontrib><title>Reward Generalization in RLHF: A Topological Perspective</title><description>Existing alignment methods share a common topology of information flow, where
reward information is collected from humans, modeled with preference learning,
and used to tune language models. However, this shared topology has not been
systematically characterized, nor have its alternatives been thoroughly
explored, leaving the problems of low data efficiency and unreliable
generalization unaddressed. As a solution, we introduce a theoretical framework
for investigating reward generalization in reinforcement learning from human
feedback (RLHF), focusing on the topology of information flow at both macro and
micro levels. At the macro level, we portray the RLHF information flow as an
autoencoding process over behavior distributions, formalizing the RLHF
objective of distributional consistency between human preference and model
behavior. At the micro level, we present induced Bayesian networks as a theory
of reward generalization in RLHF, introducing fine-grained dataset topologies
into generalization bounds. Combining analysis on both levels, we propose
reward modeling from tree-structured preference information. It is shown to
reduce reward uncertainty by up to $\Theta(\log n/\log\log n)$ times compared
to baselines, where $n$ is the dataset size. Validation on three NLP tasks
shows that our tree-based reward model achieves an average win rate of 65%
against baseline methods, thus improving reward generalization for free via
topology design.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Discrete Mathematics</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw0jM0MLQw4WSwCEotTyxKUXBPzUstSszJrEosyczPU8jMUwjy8XCzUnBUCMkvyM_JT89MTsxRCEgtKi5ITS7JLEvlYWBNS8wpTuWF0twM8m6uIc4eumA74guKMnMTiyrjQXbFg-0yJqwCAMjvMxQ</recordid><startdate>20240215</startdate><enddate>20240215</enddate><creator>Qiu, Tianyi</creator><creator>Zeng, Fanzhi</creator><creator>Ji, Jiaming</creator><creator>Yan, Dong</creator><creator>Wang, Kaile</creator><creator>Zhou, Jiayi</creator><creator>Han, Yang</creator><creator>Dai, Josef</creator><creator>Pan, Xuehai</creator><creator>Yang, Yaodong</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240215</creationdate><title>Reward Generalization in RLHF: A Topological Perspective</title><author>Qiu, Tianyi ; Zeng, Fanzhi ; Ji, Jiaming ; Yan, Dong ; Wang, Kaile ; Zhou, Jiayi ; Han, Yang ; Dai, Josef ; Pan, Xuehai ; Yang, Yaodong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2402_101843</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Discrete Mathematics</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Qiu, Tianyi</creatorcontrib><creatorcontrib>Zeng, Fanzhi</creatorcontrib><creatorcontrib>Ji, Jiaming</creatorcontrib><creatorcontrib>Yan, Dong</creatorcontrib><creatorcontrib>Wang, Kaile</creatorcontrib><creatorcontrib>Zhou, Jiayi</creatorcontrib><creatorcontrib>Han, Yang</creatorcontrib><creatorcontrib>Dai, Josef</creatorcontrib><creatorcontrib>Pan, Xuehai</creatorcontrib><creatorcontrib>Yang, Yaodong</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Qiu, Tianyi</au><au>Zeng, Fanzhi</au><au>Ji, Jiaming</au><au>Yan, Dong</au><au>Wang, Kaile</au><au>Zhou, Jiayi</au><au>Han, Yang</au><au>Dai, Josef</au><au>Pan, Xuehai</au><au>Yang, Yaodong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Reward Generalization in RLHF: A Topological Perspective</atitle><date>2024-02-15</date><risdate>2024</risdate><abstract>Existing alignment methods share a common topology of information flow, where
reward information is collected from humans, modeled with preference learning,
and used to tune language models. However, this shared topology has not been
systematically characterized, nor have its alternatives been thoroughly
explored, leaving the problems of low data efficiency and unreliable
generalization unaddressed. As a solution, we introduce a theoretical framework
for investigating reward generalization in reinforcement learning from human
feedback (RLHF), focusing on the topology of information flow at both macro and
micro levels. At the macro level, we portray the RLHF information flow as an
autoencoding process over behavior distributions, formalizing the RLHF
objective of distributional consistency between human preference and model
behavior. At the micro level, we present induced Bayesian networks as a theory
of reward generalization in RLHF, introducing fine-grained dataset topologies
into generalization bounds. Combining analysis on both levels, we propose
reward modeling from tree-structured preference information. It is shown to
reduce reward uncertainty by up to $\Theta(\log n/\log\log n)$ times compared
to baselines, where $n$ is the dataset size. Validation on three NLP tasks
shows that our tree-based reward model achieves an average win rate of 65%
against baseline methods, thus improving reward generalization for free via
topology design.</abstract><doi>10.48550/arxiv.2402.10184</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2402.10184 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2402_10184 |
source | arXiv.org |
subjects | Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Discrete Mathematics Computer Science - Learning |
title | Reward Generalization in RLHF: A Topological Perspective |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-13T19%3A22%3A41IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Reward%20Generalization%20in%20RLHF:%20A%20Topological%20Perspective&rft.au=Qiu,%20Tianyi&rft.date=2024-02-15&rft_id=info:doi/10.48550/arxiv.2402.10184&rft_dat=%3Carxiv_GOX%3E2402_10184%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |