Online Bandit Learning with Offline Preference Data
Reinforcement Learning with Human Feedback (RLHF) is at the core of fine-tuning methods for generative AI models for language and images. Such feedback is often sought as rank or preference feedback from human raters, as opposed to eliciting scores since the latter tends to be noisy. On the other ha...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Reinforcement Learning with Human Feedback (RLHF) is at the core of
fine-tuning methods for generative AI models for language and images. Such
feedback is often sought as rank or preference feedback from human raters, as
opposed to eliciting scores since the latter tends to be noisy. On the other
hand, RL theory and algorithms predominantly assume that a reward feedback is
available. In particular, approaches for online learning that can be helpful in
adaptive data collection via active learning cannot incorporate offline
preference data. In this paper, we adopt a finite-armed linear bandit model as
a prototypical model of online learning. We consider an offline preference
dataset to be available generated by an expert of unknown 'competence'. We
propose $\texttt{warmPref-PS}$, a posterior sampling algorithm for online
learning that can be warm-started with an offline dataset with noisy preference
feedback. We show that by modeling the 'competence' of the expert that
generated it, we are able to use such a dataset most effectively. We support
our claims with novel theoretical analysis of its Bayesian regret, as well as,
extensive empirical evaluation of an approximate loss function that optimizes
for infinitely many arms, and performs substantially better ($25$ to $50\%$
regret reduction) than baselines. |
---|---|
DOI: | 10.48550/arxiv.2406.09574 |