Online Self-Preferring Language Models

Aligning with human preference datasets has been critical to the success of large language models (LLMs). Reinforcement learning from human feedback (RLHF) employs a costly reward model to provide feedback for on-policy sampling responses. Recently, offline methods that directly fit responses with b...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Zhai, Yuanzhao, Zhang, Zhuo, Xu, Kele, Peng, Hanyang, Yu, Yue, Feng, Dawei, Yang, Cheng, Ding, Bo, Wang, Huaimin
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Zhai, Yuanzhao Zhang, Zhuo Xu, Kele Peng, Hanyang Yu, Yue Feng, Dawei Yang, Cheng Ding, Bo Wang, Huaimin
description	Aligning with human preference datasets has been critical to the success of large language models (LLMs). Reinforcement learning from human feedback (RLHF) employs a costly reward model to provide feedback for on-policy sampling responses. Recently, offline methods that directly fit responses with binary preferences in the dataset have emerged as alternatives. However, existing methods do not explicitly model preference strength information, which is crucial for distinguishing different response pairs. To overcome this limitation, we propose Online Self-Preferring (OSP) language models to learn from self-generated response pairs and self-judged preference strengths. For each prompt and corresponding self-generated responses, we introduce a ranked pairing method to construct multiple response pairs with preference strength information. We then propose the soft-preference cross-entropy loss to leverage such information. Empirically, we demonstrate that leveraging preference strength is crucial for avoiding overfitting and enhancing alignment performance. OSP achieves state-of-the-art alignment performance across various metrics in two widely used human preference datasets. OSP is parameter-efficient and more robust than the dominant online method, RLHF when limited offline data are available and generalizing to out-of-domain tasks. Moreover, OSP language models established by LLMs with proficiency in self-preferring can efficiently self-improve without external supervision.
doi_str_mv	10.48550/arxiv.2405.14103
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2405_14103</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2405_14103</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-d2bb2d721fb2085c7a869ac8b867bbe5bbf71d27aee8894f2909ed3794efb2ca3</originalsourceid><addsrcrecordid>eNotzr2KwkAUhuFpLCTuBViZyi5xfjMz5SK6ChEF04czmTMhEOMyorh3v_5VX_V-PIRMGc2lUYouIN67W84lVTmTjIoxme-HvhswPWIfskPEgDF2Q5uWMLRXaDHdnT32lwkZBegv-PXZhFTrVbXcZOX-Z7v8LjMotMg8d457zVlwnBrVaDCFhcY4U2jnUDkXNPNcA6IxVgZuqUUvtJX4KBoQCZm9b1_Q-jd2J4h_9RNcv8DiH1AIOrI</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Online Self-Preferring Language Models</title><source>arXiv.org</source><creator>Zhai, Yuanzhao ; Zhang, Zhuo ; Xu, Kele ; Peng, Hanyang ; Yu, Yue ; Feng, Dawei ; Yang, Cheng ; Ding, Bo ; Wang, Huaimin</creator><creatorcontrib>Zhai, Yuanzhao ; Zhang, Zhuo ; Xu, Kele ; Peng, Hanyang ; Yu, Yue ; Feng, Dawei ; Yang, Cheng ; Ding, Bo ; Wang, Huaimin</creatorcontrib><description>Aligning with human preference datasets has been critical to the success of large language models (LLMs). Reinforcement learning from human feedback (RLHF) employs a costly reward model to provide feedback for on-policy sampling responses. Recently, offline methods that directly fit responses with binary preferences in the dataset have emerged as alternatives. However, existing methods do not explicitly model preference strength information, which is crucial for distinguishing different response pairs. To overcome this limitation, we propose Online Self-Preferring (OSP) language models to learn from self-generated response pairs and self-judged preference strengths. For each prompt and corresponding self-generated responses, we introduce a ranked pairing method to construct multiple response pairs with preference strength information. We then propose the soft-preference cross-entropy loss to leverage such information. Empirically, we demonstrate that leveraging preference strength is crucial for avoiding overfitting and enhancing alignment performance. OSP achieves state-of-the-art alignment performance across various metrics in two widely used human preference datasets. OSP is parameter-efficient and more robust than the dominant online method, RLHF when limited offline data are available and generalizing to out-of-domain tasks. Moreover, OSP language models established by LLMs with proficiency in self-preferring can efficiently self-improve without external supervision.</description><identifier>DOI: 10.48550/arxiv.2405.14103</identifier><language>eng</language><subject>Computer Science - Learning</subject><creationdate>2024-05</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2405.14103$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2405.14103$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhai, Yuanzhao</creatorcontrib><creatorcontrib>Zhang, Zhuo</creatorcontrib><creatorcontrib>Xu, Kele</creatorcontrib><creatorcontrib>Peng, Hanyang</creatorcontrib><creatorcontrib>Yu, Yue</creatorcontrib><creatorcontrib>Feng, Dawei</creatorcontrib><creatorcontrib>Yang, Cheng</creatorcontrib><creatorcontrib>Ding, Bo</creatorcontrib><creatorcontrib>Wang, Huaimin</creatorcontrib><title>Online Self-Preferring Language Models</title><description>Aligning with human preference datasets has been critical to the success of large language models (LLMs). Reinforcement learning from human feedback (RLHF) employs a costly reward model to provide feedback for on-policy sampling responses. Recently, offline methods that directly fit responses with binary preferences in the dataset have emerged as alternatives. However, existing methods do not explicitly model preference strength information, which is crucial for distinguishing different response pairs. To overcome this limitation, we propose Online Self-Preferring (OSP) language models to learn from self-generated response pairs and self-judged preference strengths. For each prompt and corresponding self-generated responses, we introduce a ranked pairing method to construct multiple response pairs with preference strength information. We then propose the soft-preference cross-entropy loss to leverage such information. Empirically, we demonstrate that leveraging preference strength is crucial for avoiding overfitting and enhancing alignment performance. OSP achieves state-of-the-art alignment performance across various metrics in two widely used human preference datasets. OSP is parameter-efficient and more robust than the dominant online method, RLHF when limited offline data are available and generalizing to out-of-domain tasks. Moreover, OSP language models established by LLMs with proficiency in self-preferring can efficiently self-improve without external supervision.</description><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotzr2KwkAUhuFpLCTuBViZyi5xfjMz5SK6ChEF04czmTMhEOMyorh3v_5VX_V-PIRMGc2lUYouIN67W84lVTmTjIoxme-HvhswPWIfskPEgDF2Q5uWMLRXaDHdnT32lwkZBegv-PXZhFTrVbXcZOX-Z7v8LjMotMg8d457zVlwnBrVaDCFhcY4U2jnUDkXNPNcA6IxVgZuqUUvtJX4KBoQCZm9b1_Q-jd2J4h_9RNcv8DiH1AIOrI</recordid><startdate>20240522</startdate><enddate>20240522</enddate><creator>Zhai, Yuanzhao</creator><creator>Zhang, Zhuo</creator><creator>Xu, Kele</creator><creator>Peng, Hanyang</creator><creator>Yu, Yue</creator><creator>Feng, Dawei</creator><creator>Yang, Cheng</creator><creator>Ding, Bo</creator><creator>Wang, Huaimin</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240522</creationdate><title>Online Self-Preferring Language Models</title><author>Zhai, Yuanzhao ; Zhang, Zhuo ; Xu, Kele ; Peng, Hanyang ; Yu, Yue ; Feng, Dawei ; Yang, Cheng ; Ding, Bo ; Wang, Huaimin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-d2bb2d721fb2085c7a869ac8b867bbe5bbf71d27aee8894f2909ed3794efb2ca3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhai, Yuanzhao</creatorcontrib><creatorcontrib>Zhang, Zhuo</creatorcontrib><creatorcontrib>Xu, Kele</creatorcontrib><creatorcontrib>Peng, Hanyang</creatorcontrib><creatorcontrib>Yu, Yue</creatorcontrib><creatorcontrib>Feng, Dawei</creatorcontrib><creatorcontrib>Yang, Cheng</creatorcontrib><creatorcontrib>Ding, Bo</creatorcontrib><creatorcontrib>Wang, Huaimin</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhai, Yuanzhao</au><au>Zhang, Zhuo</au><au>Xu, Kele</au><au>Peng, Hanyang</au><au>Yu, Yue</au><au>Feng, Dawei</au><au>Yang, Cheng</au><au>Ding, Bo</au><au>Wang, Huaimin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Online Self-Preferring Language Models</atitle><date>2024-05-22</date><risdate>2024</risdate><abstract>Aligning with human preference datasets has been critical to the success of large language models (LLMs). Reinforcement learning from human feedback (RLHF) employs a costly reward model to provide feedback for on-policy sampling responses. Recently, offline methods that directly fit responses with binary preferences in the dataset have emerged as alternatives. However, existing methods do not explicitly model preference strength information, which is crucial for distinguishing different response pairs. To overcome this limitation, we propose Online Self-Preferring (OSP) language models to learn from self-generated response pairs and self-judged preference strengths. For each prompt and corresponding self-generated responses, we introduce a ranked pairing method to construct multiple response pairs with preference strength information. We then propose the soft-preference cross-entropy loss to leverage such information. Empirically, we demonstrate that leveraging preference strength is crucial for avoiding overfitting and enhancing alignment performance. OSP achieves state-of-the-art alignment performance across various metrics in two widely used human preference datasets. OSP is parameter-efficient and more robust than the dominant online method, RLHF when limited offline data are available and generalizing to out-of-domain tasks. Moreover, OSP language models established by LLMs with proficiency in self-preferring can efficiently self-improve without external supervision.</abstract><doi>10.48550/arxiv.2405.14103</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2405.14103
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2405_14103
source	arXiv.org
subjects	Computer Science - Learning
title	Online Self-Preferring Language Models
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T21%3A52%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Online%20Self-Preferring%20Language%20Models&rft.au=Zhai,%20Yuanzhao&rft.date=2024-05-22&rft_id=info:doi/10.48550/arxiv.2405.14103&rft_dat=%3Carxiv_GOX%3E2405_14103%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true