Reward Generalization in RLHF: A Topological Perspective

Existing alignment methods share a common topology of information flow, where reward information is collected from humans, modeled with preference learning, and used to tune language models. However, this shared topology has not been systematically characterized, nor have its alternatives been thoro...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Qiu, Tianyi, Zeng, Fanzhi, Ji, Jiaming, Yan, Dong, Wang, Kaile, Zhou, Jiayi, Han, Yang, Dai, Josef, Pan, Xuehai, Yang, Yaodong
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Discrete Mathematics Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Qiu, Tianyi Zeng, Fanzhi Ji, Jiaming Yan, Dong Wang, Kaile Zhou, Jiayi Han, Yang Dai, Josef Pan, Xuehai Yang, Yaodong
description	Existing alignment methods share a common topology of information flow, where reward information is collected from humans, modeled with preference learning, and used to tune language models. However, this shared topology has not been systematically characterized, nor have its alternatives been thoroughly explored, leaving the problems of low data efficiency and unreliable generalization unaddressed. As a solution, we introduce a theoretical framework for investigating reward generalization in reinforcement learning from human feedback (RLHF), focusing on the topology of information flow at both macro and micro levels. At the macro level, we portray the RLHF information flow as an autoencoding process over behavior distributions, formalizing the RLHF objective of distributional consistency between human preference and model behavior. At the micro level, we present induced Bayesian networks as a theory of reward generalization in RLHF, introducing fine-grained dataset topologies into generalization bounds. Combining analysis on both levels, we propose reward modeling from tree-structured preference information. It is shown to reduce reward uncertainty by up to $\Theta(\log n/\log\log n)$ times compared to baselines, where $n$ is the dataset size. Validation on three NLP tasks shows that our tree-based reward model achieves an average win rate of 65% against baseline methods, thus improving reward generalization for free via topology design.
doi_str_mv	10.48550/arxiv.2402.10184
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2402_10184</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2402_10184</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2402_101843</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw0jM0MLQw4WSwCEotTyxKUXBPzUstSszJrEosyczPU8jMUwjy8XCzUnBUCMkvyM_JT89MTsxRCEgtKi5ITS7JLEvlYWBNS8wpTuWF0twM8m6uIc4eumA74guKMnMTiyrjQXbFg-0yJqwCAMjvMxQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Reward Generalization in RLHF: A Topological Perspective</title><source>arXiv.org</source><creator>Qiu, Tianyi ; Zeng, Fanzhi ; Ji, Jiaming ; Yan, Dong ; Wang, Kaile ; Zhou, Jiayi ; Han, Yang ; Dai, Josef ; Pan, Xuehai ; Yang, Yaodong</creator><creatorcontrib>Qiu, Tianyi ; Zeng, Fanzhi ; Ji, Jiaming ; Yan, Dong ; Wang, Kaile ; Zhou, Jiayi ; Han, Yang ; Dai, Josef ; Pan, Xuehai ; Yang, Yaodong</creatorcontrib><description>Existing alignment methods share a common topology of information flow, where reward information is collected from humans, modeled with preference learning, and used to tune language models. However, this shared topology has not been systematically characterized, nor have its alternatives been thoroughly explored, leaving the problems of low data efficiency and unreliable generalization unaddressed. As a solution, we introduce a theoretical framework for investigating reward generalization in reinforcement learning from human feedback (RLHF), focusing on the topology of information flow at both macro and micro levels. At the macro level, we portray the RLHF information flow as an autoencoding process over behavior distributions, formalizing the RLHF objective of distributional consistency between human preference and model behavior. At the micro level, we present induced Bayesian networks as a theory of reward generalization in RLHF, introducing fine-grained dataset topologies into generalization bounds. Combining analysis on both levels, we propose reward modeling from tree-structured preference information. It is shown to reduce reward uncertainty by up to $\Theta(\log n/\log\log n)$ times compared to baselines, where $n$ is the dataset size. Validation on three NLP tasks shows that our tree-based reward model achieves an average win rate of 65% against baseline methods, thus improving reward generalization for free via topology design.</description><identifier>DOI: 10.48550/arxiv.2402.10184</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Discrete Mathematics ; Computer Science - Learning</subject><creationdate>2024-02</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2402.10184$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2402.10184$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Qiu, Tianyi</creatorcontrib><creatorcontrib>Zeng, Fanzhi</creatorcontrib><creatorcontrib>Ji, Jiaming</creatorcontrib><creatorcontrib>Yan, Dong</creatorcontrib><creatorcontrib>Wang, Kaile</creatorcontrib><creatorcontrib>Zhou, Jiayi</creatorcontrib><creatorcontrib>Han, Yang</creatorcontrib><creatorcontrib>Dai, Josef</creatorcontrib><creatorcontrib>Pan, Xuehai</creatorcontrib><creatorcontrib>Yang, Yaodong</creatorcontrib><title>Reward Generalization in RLHF: A Topological Perspective</title><description>Existing alignment methods share a common topology of information flow, where reward information is collected from humans, modeled with preference learning, and used to tune language models. However, this shared topology has not been systematically characterized, nor have its alternatives been thoroughly explored, leaving the problems of low data efficiency and unreliable generalization unaddressed. As a solution, we introduce a theoretical framework for investigating reward generalization in reinforcement learning from human feedback (RLHF), focusing on the topology of information flow at both macro and micro levels. At the macro level, we portray the RLHF information flow as an autoencoding process over behavior distributions, formalizing the RLHF objective of distributional consistency between human preference and model behavior. At the micro level, we present induced Bayesian networks as a theory of reward generalization in RLHF, introducing fine-grained dataset topologies into generalization bounds. Combining analysis on both levels, we propose reward modeling from tree-structured preference information. It is shown to reduce reward uncertainty by up to $\Theta(\log n/\log\log n)$ times compared to baselines, where $n$ is the dataset size. Validation on three NLP tasks shows that our tree-based reward model achieves an average win rate of 65% against baseline methods, thus improving reward generalization for free via topology design.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Discrete Mathematics</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw0jM0MLQw4WSwCEotTyxKUXBPzUstSszJrEosyczPU8jMUwjy8XCzUnBUCMkvyM_JT89MTsxRCEgtKi5ITS7JLEvlYWBNS8wpTuWF0twM8m6uIc4eumA74guKMnMTiyrjQXbFg-0yJqwCAMjvMxQ</recordid><startdate>20240215</startdate><enddate>20240215</enddate><creator>Qiu, Tianyi</creator><creator>Zeng, Fanzhi</creator><creator>Ji, Jiaming</creator><creator>Yan, Dong</creator><creator>Wang, Kaile</creator><creator>Zhou, Jiayi</creator><creator>Han, Yang</creator><creator>Dai, Josef</creator><creator>Pan, Xuehai</creator><creator>Yang, Yaodong</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240215</creationdate><title>Reward Generalization in RLHF: A Topological Perspective</title><author>Qiu, Tianyi ; Zeng, Fanzhi ; Ji, Jiaming ; Yan, Dong ; Wang, Kaile ; Zhou, Jiayi ; Han, Yang ; Dai, Josef ; Pan, Xuehai ; Yang, Yaodong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2402_101843</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Discrete Mathematics</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Qiu, Tianyi</creatorcontrib><creatorcontrib>Zeng, Fanzhi</creatorcontrib><creatorcontrib>Ji, Jiaming</creatorcontrib><creatorcontrib>Yan, Dong</creatorcontrib><creatorcontrib>Wang, Kaile</creatorcontrib><creatorcontrib>Zhou, Jiayi</creatorcontrib><creatorcontrib>Han, Yang</creatorcontrib><creatorcontrib>Dai, Josef</creatorcontrib><creatorcontrib>Pan, Xuehai</creatorcontrib><creatorcontrib>Yang, Yaodong</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Qiu, Tianyi</au><au>Zeng, Fanzhi</au><au>Ji, Jiaming</au><au>Yan, Dong</au><au>Wang, Kaile</au><au>Zhou, Jiayi</au><au>Han, Yang</au><au>Dai, Josef</au><au>Pan, Xuehai</au><au>Yang, Yaodong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Reward Generalization in RLHF: A Topological Perspective</atitle><date>2024-02-15</date><risdate>2024</risdate><abstract>Existing alignment methods share a common topology of information flow, where reward information is collected from humans, modeled with preference learning, and used to tune language models. However, this shared topology has not been systematically characterized, nor have its alternatives been thoroughly explored, leaving the problems of low data efficiency and unreliable generalization unaddressed. As a solution, we introduce a theoretical framework for investigating reward generalization in reinforcement learning from human feedback (RLHF), focusing on the topology of information flow at both macro and micro levels. At the macro level, we portray the RLHF information flow as an autoencoding process over behavior distributions, formalizing the RLHF objective of distributional consistency between human preference and model behavior. At the micro level, we present induced Bayesian networks as a theory of reward generalization in RLHF, introducing fine-grained dataset topologies into generalization bounds. Combining analysis on both levels, we propose reward modeling from tree-structured preference information. It is shown to reduce reward uncertainty by up to $\Theta(\log n/\log\log n)$ times compared to baselines, where $n$ is the dataset size. Validation on three NLP tasks shows that our tree-based reward model achieves an average win rate of 65% against baseline methods, thus improving reward generalization for free via topology design.</abstract><doi>10.48550/arxiv.2402.10184</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2402.10184
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2402_10184
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Discrete Mathematics Computer Science - Learning
title	Reward Generalization in RLHF: A Topological Perspective
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-13T19%3A22%3A41IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Reward%20Generalization%20in%20RLHF:%20A%20Topological%20Perspective&rft.au=Qiu,%20Tianyi&rft.date=2024-02-15&rft_id=info:doi/10.48550/arxiv.2402.10184&rft_dat=%3Carxiv_GOX%3E2402_10184%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true