Online Self-Preferring Language Models
Aligning with human preference datasets has been critical to the success of large language models (LLMs). Reinforcement learning from human feedback (RLHF) employs a costly reward model to provide feedback for on-policy sampling responses. Recently, offline methods that directly fit responses with b...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Zhai, Yuanzhao Zhang, Zhuo Xu, Kele Peng, Hanyang Yu, Yue Feng, Dawei Yang, Cheng Ding, Bo Wang, Huaimin |
description | Aligning with human preference datasets has been critical to the success of
large language models (LLMs). Reinforcement learning from human feedback (RLHF)
employs a costly reward model to provide feedback for on-policy sampling
responses. Recently, offline methods that directly fit responses with binary
preferences in the dataset have emerged as alternatives. However, existing
methods do not explicitly model preference strength information, which is
crucial for distinguishing different response pairs. To overcome this
limitation, we propose Online Self-Preferring (OSP) language models to learn
from self-generated response pairs and self-judged preference strengths. For
each prompt and corresponding self-generated responses, we introduce a ranked
pairing method to construct multiple response pairs with preference strength
information. We then propose the soft-preference cross-entropy loss to leverage
such information. Empirically, we demonstrate that leveraging preference
strength is crucial for avoiding overfitting and enhancing alignment
performance. OSP achieves state-of-the-art alignment performance across various
metrics in two widely used human preference datasets. OSP is
parameter-efficient and more robust than the dominant online method, RLHF when
limited offline data are available and generalizing to out-of-domain tasks.
Moreover, OSP language models established by LLMs with proficiency in
self-preferring can efficiently self-improve without external supervision. |
doi_str_mv | 10.48550/arxiv.2405.14103 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2405_14103</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2405_14103</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-d2bb2d721fb2085c7a869ac8b867bbe5bbf71d27aee8894f2909ed3794efb2ca3</originalsourceid><addsrcrecordid>eNotzr2KwkAUhuFpLCTuBViZyi5xfjMz5SK6ChEF04czmTMhEOMyorh3v_5VX_V-PIRMGc2lUYouIN67W84lVTmTjIoxme-HvhswPWIfskPEgDF2Q5uWMLRXaDHdnT32lwkZBegv-PXZhFTrVbXcZOX-Z7v8LjMotMg8d457zVlwnBrVaDCFhcY4U2jnUDkXNPNcA6IxVgZuqUUvtJX4KBoQCZm9b1_Q-jd2J4h_9RNcv8DiH1AIOrI</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Online Self-Preferring Language Models</title><source>arXiv.org</source><creator>Zhai, Yuanzhao ; Zhang, Zhuo ; Xu, Kele ; Peng, Hanyang ; Yu, Yue ; Feng, Dawei ; Yang, Cheng ; Ding, Bo ; Wang, Huaimin</creator><creatorcontrib>Zhai, Yuanzhao ; Zhang, Zhuo ; Xu, Kele ; Peng, Hanyang ; Yu, Yue ; Feng, Dawei ; Yang, Cheng ; Ding, Bo ; Wang, Huaimin</creatorcontrib><description>Aligning with human preference datasets has been critical to the success of
large language models (LLMs). Reinforcement learning from human feedback (RLHF)
employs a costly reward model to provide feedback for on-policy sampling
responses. Recently, offline methods that directly fit responses with binary
preferences in the dataset have emerged as alternatives. However, existing
methods do not explicitly model preference strength information, which is
crucial for distinguishing different response pairs. To overcome this
limitation, we propose Online Self-Preferring (OSP) language models to learn
from self-generated response pairs and self-judged preference strengths. For
each prompt and corresponding self-generated responses, we introduce a ranked
pairing method to construct multiple response pairs with preference strength
information. We then propose the soft-preference cross-entropy loss to leverage
such information. Empirically, we demonstrate that leveraging preference
strength is crucial for avoiding overfitting and enhancing alignment
performance. OSP achieves state-of-the-art alignment performance across various
metrics in two widely used human preference datasets. OSP is
parameter-efficient and more robust than the dominant online method, RLHF when
limited offline data are available and generalizing to out-of-domain tasks.
Moreover, OSP language models established by LLMs with proficiency in
self-preferring can efficiently self-improve without external supervision.</description><identifier>DOI: 10.48550/arxiv.2405.14103</identifier><language>eng</language><subject>Computer Science - Learning</subject><creationdate>2024-05</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2405.14103$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2405.14103$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhai, Yuanzhao</creatorcontrib><creatorcontrib>Zhang, Zhuo</creatorcontrib><creatorcontrib>Xu, Kele</creatorcontrib><creatorcontrib>Peng, Hanyang</creatorcontrib><creatorcontrib>Yu, Yue</creatorcontrib><creatorcontrib>Feng, Dawei</creatorcontrib><creatorcontrib>Yang, Cheng</creatorcontrib><creatorcontrib>Ding, Bo</creatorcontrib><creatorcontrib>Wang, Huaimin</creatorcontrib><title>Online Self-Preferring Language Models</title><description>Aligning with human preference datasets has been critical to the success of
large language models (LLMs). Reinforcement learning from human feedback (RLHF)
employs a costly reward model to provide feedback for on-policy sampling
responses. Recently, offline methods that directly fit responses with binary
preferences in the dataset have emerged as alternatives. However, existing
methods do not explicitly model preference strength information, which is
crucial for distinguishing different response pairs. To overcome this
limitation, we propose Online Self-Preferring (OSP) language models to learn
from self-generated response pairs and self-judged preference strengths. For
each prompt and corresponding self-generated responses, we introduce a ranked
pairing method to construct multiple response pairs with preference strength
information. We then propose the soft-preference cross-entropy loss to leverage
such information. Empirically, we demonstrate that leveraging preference
strength is crucial for avoiding overfitting and enhancing alignment
performance. OSP achieves state-of-the-art alignment performance across various
metrics in two widely used human preference datasets. OSP is
parameter-efficient and more robust than the dominant online method, RLHF when
limited offline data are available and generalizing to out-of-domain tasks.
Moreover, OSP language models established by LLMs with proficiency in
self-preferring can efficiently self-improve without external supervision.</description><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotzr2KwkAUhuFpLCTuBViZyi5xfjMz5SK6ChEF04czmTMhEOMyorh3v_5VX_V-PIRMGc2lUYouIN67W84lVTmTjIoxme-HvhswPWIfskPEgDF2Q5uWMLRXaDHdnT32lwkZBegv-PXZhFTrVbXcZOX-Z7v8LjMotMg8d457zVlwnBrVaDCFhcY4U2jnUDkXNPNcA6IxVgZuqUUvtJX4KBoQCZm9b1_Q-jd2J4h_9RNcv8DiH1AIOrI</recordid><startdate>20240522</startdate><enddate>20240522</enddate><creator>Zhai, Yuanzhao</creator><creator>Zhang, Zhuo</creator><creator>Xu, Kele</creator><creator>Peng, Hanyang</creator><creator>Yu, Yue</creator><creator>Feng, Dawei</creator><creator>Yang, Cheng</creator><creator>Ding, Bo</creator><creator>Wang, Huaimin</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240522</creationdate><title>Online Self-Preferring Language Models</title><author>Zhai, Yuanzhao ; Zhang, Zhuo ; Xu, Kele ; Peng, Hanyang ; Yu, Yue ; Feng, Dawei ; Yang, Cheng ; Ding, Bo ; Wang, Huaimin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-d2bb2d721fb2085c7a869ac8b867bbe5bbf71d27aee8894f2909ed3794efb2ca3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhai, Yuanzhao</creatorcontrib><creatorcontrib>Zhang, Zhuo</creatorcontrib><creatorcontrib>Xu, Kele</creatorcontrib><creatorcontrib>Peng, Hanyang</creatorcontrib><creatorcontrib>Yu, Yue</creatorcontrib><creatorcontrib>Feng, Dawei</creatorcontrib><creatorcontrib>Yang, Cheng</creatorcontrib><creatorcontrib>Ding, Bo</creatorcontrib><creatorcontrib>Wang, Huaimin</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhai, Yuanzhao</au><au>Zhang, Zhuo</au><au>Xu, Kele</au><au>Peng, Hanyang</au><au>Yu, Yue</au><au>Feng, Dawei</au><au>Yang, Cheng</au><au>Ding, Bo</au><au>Wang, Huaimin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Online Self-Preferring Language Models</atitle><date>2024-05-22</date><risdate>2024</risdate><abstract>Aligning with human preference datasets has been critical to the success of
large language models (LLMs). Reinforcement learning from human feedback (RLHF)
employs a costly reward model to provide feedback for on-policy sampling
responses. Recently, offline methods that directly fit responses with binary
preferences in the dataset have emerged as alternatives. However, existing
methods do not explicitly model preference strength information, which is
crucial for distinguishing different response pairs. To overcome this
limitation, we propose Online Self-Preferring (OSP) language models to learn
from self-generated response pairs and self-judged preference strengths. For
each prompt and corresponding self-generated responses, we introduce a ranked
pairing method to construct multiple response pairs with preference strength
information. We then propose the soft-preference cross-entropy loss to leverage
such information. Empirically, we demonstrate that leveraging preference
strength is crucial for avoiding overfitting and enhancing alignment
performance. OSP achieves state-of-the-art alignment performance across various
metrics in two widely used human preference datasets. OSP is
parameter-efficient and more robust than the dominant online method, RLHF when
limited offline data are available and generalizing to out-of-domain tasks.
Moreover, OSP language models established by LLMs with proficiency in
self-preferring can efficiently self-improve without external supervision.</abstract><doi>10.48550/arxiv.2405.14103</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2405.14103 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2405_14103 |
source | arXiv.org |
subjects | Computer Science - Learning |
title | Online Self-Preferring Language Models |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T21%3A52%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Online%20Self-Preferring%20Language%20Models&rft.au=Zhai,%20Yuanzhao&rft.date=2024-05-22&rft_id=info:doi/10.48550/arxiv.2405.14103&rft_dat=%3Carxiv_GOX%3E2405_14103%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |