Reward Generalization in RLHF: A Topological Perspective

Existing alignment methods share a common topology of information flow, where reward information is collected from humans, modeled with preference learning, and used to tune language models. However, this shared topology has not been systematically characterized, nor have its alternatives been thoro...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Qiu, Tianyi, Zeng, Fanzhi, Ji, Jiaming, Yan, Dong, Wang, Kaile, Zhou, Jiayi, Han, Yang, Dai, Josef, Pan, Xuehai, Yang, Yaodong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Qiu, Tianyi
Zeng, Fanzhi
Ji, Jiaming
Yan, Dong
Wang, Kaile
Zhou, Jiayi
Han, Yang
Dai, Josef
Pan, Xuehai
Yang, Yaodong
description Existing alignment methods share a common topology of information flow, where reward information is collected from humans, modeled with preference learning, and used to tune language models. However, this shared topology has not been systematically characterized, nor have its alternatives been thoroughly explored, leaving the problems of low data efficiency and unreliable generalization unaddressed. As a solution, we introduce a theoretical framework for investigating reward generalization in reinforcement learning from human feedback (RLHF), focusing on the topology of information flow at both macro and micro levels. At the macro level, we portray the RLHF information flow as an autoencoding process over behavior distributions, formalizing the RLHF objective of distributional consistency between human preference and model behavior. At the micro level, we present induced Bayesian networks as a theory of reward generalization in RLHF, introducing fine-grained dataset topologies into generalization bounds. Combining analysis on both levels, we propose reward modeling from tree-structured preference information. It is shown to reduce reward uncertainty by up to $\Theta(\log n/\log\log n)$ times compared to baselines, where $n$ is the dataset size. Validation on three NLP tasks shows that our tree-based reward model achieves an average win rate of 65% against baseline methods, thus improving reward generalization for free via topology design.
doi_str_mv 10.48550/arxiv.2402.10184
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2402_10184</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2402_10184</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2402_101843</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw0jM0MLQw4WSwCEotTyxKUXBPzUstSszJrEosyczPU8jMUwjy8XCzUnBUCMkvyM_JT89MTsxRCEgtKi5ITS7JLEvlYWBNS8wpTuWF0twM8m6uIc4eumA74guKMnMTiyrjQXbFg-0yJqwCAMjvMxQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Reward Generalization in RLHF: A Topological Perspective</title><source>arXiv.org</source><creator>Qiu, Tianyi ; Zeng, Fanzhi ; Ji, Jiaming ; Yan, Dong ; Wang, Kaile ; Zhou, Jiayi ; Han, Yang ; Dai, Josef ; Pan, Xuehai ; Yang, Yaodong</creator><creatorcontrib>Qiu, Tianyi ; Zeng, Fanzhi ; Ji, Jiaming ; Yan, Dong ; Wang, Kaile ; Zhou, Jiayi ; Han, Yang ; Dai, Josef ; Pan, Xuehai ; Yang, Yaodong</creatorcontrib><description>Existing alignment methods share a common topology of information flow, where reward information is collected from humans, modeled with preference learning, and used to tune language models. However, this shared topology has not been systematically characterized, nor have its alternatives been thoroughly explored, leaving the problems of low data efficiency and unreliable generalization unaddressed. As a solution, we introduce a theoretical framework for investigating reward generalization in reinforcement learning from human feedback (RLHF), focusing on the topology of information flow at both macro and micro levels. At the macro level, we portray the RLHF information flow as an autoencoding process over behavior distributions, formalizing the RLHF objective of distributional consistency between human preference and model behavior. At the micro level, we present induced Bayesian networks as a theory of reward generalization in RLHF, introducing fine-grained dataset topologies into generalization bounds. Combining analysis on both levels, we propose reward modeling from tree-structured preference information. It is shown to reduce reward uncertainty by up to $\Theta(\log n/\log\log n)$ times compared to baselines, where $n$ is the dataset size. Validation on three NLP tasks shows that our tree-based reward model achieves an average win rate of 65% against baseline methods, thus improving reward generalization for free via topology design.</description><identifier>DOI: 10.48550/arxiv.2402.10184</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Discrete Mathematics ; Computer Science - Learning</subject><creationdate>2024-02</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2402.10184$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2402.10184$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Qiu, Tianyi</creatorcontrib><creatorcontrib>Zeng, Fanzhi</creatorcontrib><creatorcontrib>Ji, Jiaming</creatorcontrib><creatorcontrib>Yan, Dong</creatorcontrib><creatorcontrib>Wang, Kaile</creatorcontrib><creatorcontrib>Zhou, Jiayi</creatorcontrib><creatorcontrib>Han, Yang</creatorcontrib><creatorcontrib>Dai, Josef</creatorcontrib><creatorcontrib>Pan, Xuehai</creatorcontrib><creatorcontrib>Yang, Yaodong</creatorcontrib><title>Reward Generalization in RLHF: A Topological Perspective</title><description>Existing alignment methods share a common topology of information flow, where reward information is collected from humans, modeled with preference learning, and used to tune language models. However, this shared topology has not been systematically characterized, nor have its alternatives been thoroughly explored, leaving the problems of low data efficiency and unreliable generalization unaddressed. As a solution, we introduce a theoretical framework for investigating reward generalization in reinforcement learning from human feedback (RLHF), focusing on the topology of information flow at both macro and micro levels. At the macro level, we portray the RLHF information flow as an autoencoding process over behavior distributions, formalizing the RLHF objective of distributional consistency between human preference and model behavior. At the micro level, we present induced Bayesian networks as a theory of reward generalization in RLHF, introducing fine-grained dataset topologies into generalization bounds. Combining analysis on both levels, we propose reward modeling from tree-structured preference information. It is shown to reduce reward uncertainty by up to $\Theta(\log n/\log\log n)$ times compared to baselines, where $n$ is the dataset size. Validation on three NLP tasks shows that our tree-based reward model achieves an average win rate of 65% against baseline methods, thus improving reward generalization for free via topology design.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Discrete Mathematics</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw0jM0MLQw4WSwCEotTyxKUXBPzUstSszJrEosyczPU8jMUwjy8XCzUnBUCMkvyM_JT89MTsxRCEgtKi5ITS7JLEvlYWBNS8wpTuWF0twM8m6uIc4eumA74guKMnMTiyrjQXbFg-0yJqwCAMjvMxQ</recordid><startdate>20240215</startdate><enddate>20240215</enddate><creator>Qiu, Tianyi</creator><creator>Zeng, Fanzhi</creator><creator>Ji, Jiaming</creator><creator>Yan, Dong</creator><creator>Wang, Kaile</creator><creator>Zhou, Jiayi</creator><creator>Han, Yang</creator><creator>Dai, Josef</creator><creator>Pan, Xuehai</creator><creator>Yang, Yaodong</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240215</creationdate><title>Reward Generalization in RLHF: A Topological Perspective</title><author>Qiu, Tianyi ; Zeng, Fanzhi ; Ji, Jiaming ; Yan, Dong ; Wang, Kaile ; Zhou, Jiayi ; Han, Yang ; Dai, Josef ; Pan, Xuehai ; Yang, Yaodong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2402_101843</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Discrete Mathematics</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Qiu, Tianyi</creatorcontrib><creatorcontrib>Zeng, Fanzhi</creatorcontrib><creatorcontrib>Ji, Jiaming</creatorcontrib><creatorcontrib>Yan, Dong</creatorcontrib><creatorcontrib>Wang, Kaile</creatorcontrib><creatorcontrib>Zhou, Jiayi</creatorcontrib><creatorcontrib>Han, Yang</creatorcontrib><creatorcontrib>Dai, Josef</creatorcontrib><creatorcontrib>Pan, Xuehai</creatorcontrib><creatorcontrib>Yang, Yaodong</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Qiu, Tianyi</au><au>Zeng, Fanzhi</au><au>Ji, Jiaming</au><au>Yan, Dong</au><au>Wang, Kaile</au><au>Zhou, Jiayi</au><au>Han, Yang</au><au>Dai, Josef</au><au>Pan, Xuehai</au><au>Yang, Yaodong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Reward Generalization in RLHF: A Topological Perspective</atitle><date>2024-02-15</date><risdate>2024</risdate><abstract>Existing alignment methods share a common topology of information flow, where reward information is collected from humans, modeled with preference learning, and used to tune language models. However, this shared topology has not been systematically characterized, nor have its alternatives been thoroughly explored, leaving the problems of low data efficiency and unreliable generalization unaddressed. As a solution, we introduce a theoretical framework for investigating reward generalization in reinforcement learning from human feedback (RLHF), focusing on the topology of information flow at both macro and micro levels. At the macro level, we portray the RLHF information flow as an autoencoding process over behavior distributions, formalizing the RLHF objective of distributional consistency between human preference and model behavior. At the micro level, we present induced Bayesian networks as a theory of reward generalization in RLHF, introducing fine-grained dataset topologies into generalization bounds. Combining analysis on both levels, we propose reward modeling from tree-structured preference information. It is shown to reduce reward uncertainty by up to $\Theta(\log n/\log\log n)$ times compared to baselines, where $n$ is the dataset size. Validation on three NLP tasks shows that our tree-based reward model achieves an average win rate of 65% against baseline methods, thus improving reward generalization for free via topology design.</abstract><doi>10.48550/arxiv.2402.10184</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2402.10184
ispartof
issn
language eng
recordid cdi_arxiv_primary_2402_10184
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Computation and Language
Computer Science - Discrete Mathematics
Computer Science - Learning
title Reward Generalization in RLHF: A Topological Perspective
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-13T19%3A22%3A41IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Reward%20Generalization%20in%20RLHF:%20A%20Topological%20Perspective&rft.au=Qiu,%20Tianyi&rft.date=2024-02-15&rft_id=info:doi/10.48550/arxiv.2402.10184&rft_dat=%3Carxiv_GOX%3E2402_10184%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true