Leveraging Domain Knowledge for Efficient Reward Modelling in RLHF: A Case-Study in E-Commerce Opinion Summarization

Reinforcement Learning from Human Feedback (RLHF) has become a dominating strategy in aligning Language Models (LMs) with human values/goals. The key to the strategy is learning a reward model (\(\varphi\)), which can reflect the latent reward model of humans. While this strategy has proven effectiv...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-04
Hauptverfasser:	Nath, Swaroop, Siledar, Tejpalsingh, Sankara Sri Raghava Ravindra Muddu, Rangaraju, Rupasai, Khadilkar, Harshad, Bhattacharyya, Pushpak, Banerjee, Suman, Patil, Amey, Sudhanshu Shekhar Singh, Muthusamy Chelliah, Garera, Nikesh
Format:	Artikel
Sprache:	eng
Schlagworte:	Annotations Datasets Electronic commerce Methodology Modelling Preferences Steering
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Reinforcement Learning from Human Feedback (RLHF) has become a dominating strategy in aligning Language Models (LMs) with human values/goals. The key to the strategy is learning a reward model (\(\varphi\)), which can reflect the latent reward model of humans. While this strategy has proven effective, the training methodology requires a lot of human preference annotation (usually in the order of tens of thousands) to train \(\varphi\). Such a large-scale annotation is justifiable when it's a one-time effort, and the reward model is universally applicable. However, human goals are subjective and depend on the task, requiring task-specific preference annotations, which can be impractical to fulfill. To address this challenge, we propose a novel approach to infuse domain knowledge into \(\varphi\), which reduces the amount of preference annotation required (\(21\times\)), omits Alignment Tax, and provides some interpretability. We validate our approach in E-Commerce Opinion Summarization, with a significant reduction in dataset size (to just \(940\) samples) while advancing the SOTA (\(\sim4\) point ROUGE-L improvement, \(68\%\) of times preferred by humans over SOTA). Our contributions include a novel Reward Modeling technique and two new datasets: PromptOpinSumm (supervised data for Opinion Summarization) and OpinPref (a gold-standard human preference dataset). The proposed methodology opens up avenues for efficient RLHF, making it more adaptable to applications with varying human values. We release the artifacts (Code: github.com/efficient-rlhf. PromptOpinSumm: hf.co/prompt-opin-summ. OpinPref: hf.co/opin-pref) for usage under MIT License.
ISSN:	2331-8422