Online Self-Preferring Language Models

Aligning with human preference datasets has been critical to the success of large language models (LLMs). Reinforcement learning from human feedback (RLHF) employs a costly reward model to provide feedback for on-policy sampling responses. Recently, offline methods that directly fit responses with b...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Zhai, Yuanzhao, Zhang, Zhuo, Xu, Kele, Peng, Hanyang, Yu, Yue, Feng, Dawei, Yang, Cheng, Ding, Bo, Wang, Huaimin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Zhai, Yuanzhao
Zhang, Zhuo
Xu, Kele
Peng, Hanyang
Yu, Yue
Feng, Dawei
Yang, Cheng
Ding, Bo
Wang, Huaimin
description Aligning with human preference datasets has been critical to the success of large language models (LLMs). Reinforcement learning from human feedback (RLHF) employs a costly reward model to provide feedback for on-policy sampling responses. Recently, offline methods that directly fit responses with binary preferences in the dataset have emerged as alternatives. However, existing methods do not explicitly model preference strength information, which is crucial for distinguishing different response pairs. To overcome this limitation, we propose Online Self-Preferring (OSP) language models to learn from self-generated response pairs and self-judged preference strengths. For each prompt and corresponding self-generated responses, we introduce a ranked pairing method to construct multiple response pairs with preference strength information. We then propose the soft-preference cross-entropy loss to leverage such information. Empirically, we demonstrate that leveraging preference strength is crucial for avoiding overfitting and enhancing alignment performance. OSP achieves state-of-the-art alignment performance across various metrics in two widely used human preference datasets. OSP is parameter-efficient and more robust than the dominant online method, RLHF when limited offline data are available and generalizing to out-of-domain tasks. Moreover, OSP language models established by LLMs with proficiency in self-preferring can efficiently self-improve without external supervision.
doi_str_mv 10.48550/arxiv.2405.14103
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2405_14103</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2405_14103</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-d2bb2d721fb2085c7a869ac8b867bbe5bbf71d27aee8894f2909ed3794efb2ca3</originalsourceid><addsrcrecordid>eNotzr2KwkAUhuFpLCTuBViZyi5xfjMz5SK6ChEF04czmTMhEOMyorh3v_5VX_V-PIRMGc2lUYouIN67W84lVTmTjIoxme-HvhswPWIfskPEgDF2Q5uWMLRXaDHdnT32lwkZBegv-PXZhFTrVbXcZOX-Z7v8LjMotMg8d457zVlwnBrVaDCFhcY4U2jnUDkXNPNcA6IxVgZuqUUvtJX4KBoQCZm9b1_Q-jd2J4h_9RNcv8DiH1AIOrI</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Online Self-Preferring Language Models</title><source>arXiv.org</source><creator>Zhai, Yuanzhao ; Zhang, Zhuo ; Xu, Kele ; Peng, Hanyang ; Yu, Yue ; Feng, Dawei ; Yang, Cheng ; Ding, Bo ; Wang, Huaimin</creator><creatorcontrib>Zhai, Yuanzhao ; Zhang, Zhuo ; Xu, Kele ; Peng, Hanyang ; Yu, Yue ; Feng, Dawei ; Yang, Cheng ; Ding, Bo ; Wang, Huaimin</creatorcontrib><description>Aligning with human preference datasets has been critical to the success of large language models (LLMs). Reinforcement learning from human feedback (RLHF) employs a costly reward model to provide feedback for on-policy sampling responses. Recently, offline methods that directly fit responses with binary preferences in the dataset have emerged as alternatives. However, existing methods do not explicitly model preference strength information, which is crucial for distinguishing different response pairs. To overcome this limitation, we propose Online Self-Preferring (OSP) language models to learn from self-generated response pairs and self-judged preference strengths. For each prompt and corresponding self-generated responses, we introduce a ranked pairing method to construct multiple response pairs with preference strength information. We then propose the soft-preference cross-entropy loss to leverage such information. Empirically, we demonstrate that leveraging preference strength is crucial for avoiding overfitting and enhancing alignment performance. OSP achieves state-of-the-art alignment performance across various metrics in two widely used human preference datasets. OSP is parameter-efficient and more robust than the dominant online method, RLHF when limited offline data are available and generalizing to out-of-domain tasks. Moreover, OSP language models established by LLMs with proficiency in self-preferring can efficiently self-improve without external supervision.</description><identifier>DOI: 10.48550/arxiv.2405.14103</identifier><language>eng</language><subject>Computer Science - Learning</subject><creationdate>2024-05</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2405.14103$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2405.14103$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhai, Yuanzhao</creatorcontrib><creatorcontrib>Zhang, Zhuo</creatorcontrib><creatorcontrib>Xu, Kele</creatorcontrib><creatorcontrib>Peng, Hanyang</creatorcontrib><creatorcontrib>Yu, Yue</creatorcontrib><creatorcontrib>Feng, Dawei</creatorcontrib><creatorcontrib>Yang, Cheng</creatorcontrib><creatorcontrib>Ding, Bo</creatorcontrib><creatorcontrib>Wang, Huaimin</creatorcontrib><title>Online Self-Preferring Language Models</title><description>Aligning with human preference datasets has been critical to the success of large language models (LLMs). Reinforcement learning from human feedback (RLHF) employs a costly reward model to provide feedback for on-policy sampling responses. Recently, offline methods that directly fit responses with binary preferences in the dataset have emerged as alternatives. However, existing methods do not explicitly model preference strength information, which is crucial for distinguishing different response pairs. To overcome this limitation, we propose Online Self-Preferring (OSP) language models to learn from self-generated response pairs and self-judged preference strengths. For each prompt and corresponding self-generated responses, we introduce a ranked pairing method to construct multiple response pairs with preference strength information. We then propose the soft-preference cross-entropy loss to leverage such information. Empirically, we demonstrate that leveraging preference strength is crucial for avoiding overfitting and enhancing alignment performance. OSP achieves state-of-the-art alignment performance across various metrics in two widely used human preference datasets. OSP is parameter-efficient and more robust than the dominant online method, RLHF when limited offline data are available and generalizing to out-of-domain tasks. Moreover, OSP language models established by LLMs with proficiency in self-preferring can efficiently self-improve without external supervision.</description><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotzr2KwkAUhuFpLCTuBViZyi5xfjMz5SK6ChEF04czmTMhEOMyorh3v_5VX_V-PIRMGc2lUYouIN67W84lVTmTjIoxme-HvhswPWIfskPEgDF2Q5uWMLRXaDHdnT32lwkZBegv-PXZhFTrVbXcZOX-Z7v8LjMotMg8d457zVlwnBrVaDCFhcY4U2jnUDkXNPNcA6IxVgZuqUUvtJX4KBoQCZm9b1_Q-jd2J4h_9RNcv8DiH1AIOrI</recordid><startdate>20240522</startdate><enddate>20240522</enddate><creator>Zhai, Yuanzhao</creator><creator>Zhang, Zhuo</creator><creator>Xu, Kele</creator><creator>Peng, Hanyang</creator><creator>Yu, Yue</creator><creator>Feng, Dawei</creator><creator>Yang, Cheng</creator><creator>Ding, Bo</creator><creator>Wang, Huaimin</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240522</creationdate><title>Online Self-Preferring Language Models</title><author>Zhai, Yuanzhao ; Zhang, Zhuo ; Xu, Kele ; Peng, Hanyang ; Yu, Yue ; Feng, Dawei ; Yang, Cheng ; Ding, Bo ; Wang, Huaimin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-d2bb2d721fb2085c7a869ac8b867bbe5bbf71d27aee8894f2909ed3794efb2ca3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhai, Yuanzhao</creatorcontrib><creatorcontrib>Zhang, Zhuo</creatorcontrib><creatorcontrib>Xu, Kele</creatorcontrib><creatorcontrib>Peng, Hanyang</creatorcontrib><creatorcontrib>Yu, Yue</creatorcontrib><creatorcontrib>Feng, Dawei</creatorcontrib><creatorcontrib>Yang, Cheng</creatorcontrib><creatorcontrib>Ding, Bo</creatorcontrib><creatorcontrib>Wang, Huaimin</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhai, Yuanzhao</au><au>Zhang, Zhuo</au><au>Xu, Kele</au><au>Peng, Hanyang</au><au>Yu, Yue</au><au>Feng, Dawei</au><au>Yang, Cheng</au><au>Ding, Bo</au><au>Wang, Huaimin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Online Self-Preferring Language Models</atitle><date>2024-05-22</date><risdate>2024</risdate><abstract>Aligning with human preference datasets has been critical to the success of large language models (LLMs). Reinforcement learning from human feedback (RLHF) employs a costly reward model to provide feedback for on-policy sampling responses. Recently, offline methods that directly fit responses with binary preferences in the dataset have emerged as alternatives. However, existing methods do not explicitly model preference strength information, which is crucial for distinguishing different response pairs. To overcome this limitation, we propose Online Self-Preferring (OSP) language models to learn from self-generated response pairs and self-judged preference strengths. For each prompt and corresponding self-generated responses, we introduce a ranked pairing method to construct multiple response pairs with preference strength information. We then propose the soft-preference cross-entropy loss to leverage such information. Empirically, we demonstrate that leveraging preference strength is crucial for avoiding overfitting and enhancing alignment performance. OSP achieves state-of-the-art alignment performance across various metrics in two widely used human preference datasets. OSP is parameter-efficient and more robust than the dominant online method, RLHF when limited offline data are available and generalizing to out-of-domain tasks. Moreover, OSP language models established by LLMs with proficiency in self-preferring can efficiently self-improve without external supervision.</abstract><doi>10.48550/arxiv.2405.14103</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2405.14103
ispartof
issn
language eng
recordid cdi_arxiv_primary_2405_14103
source arXiv.org
subjects Computer Science - Learning
title Online Self-Preferring Language Models
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T21%3A52%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Online%20Self-Preferring%20Language%20Models&rft.au=Zhai,%20Yuanzhao&rft.date=2024-05-22&rft_id=info:doi/10.48550/arxiv.2405.14103&rft_dat=%3Carxiv_GOX%3E2405_14103%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true