Training Language Models to Self-Correct via Reinforcement Learning
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of superv...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Self-correction is a highly desirable capability of large language models
(LLMs), yet it has consistently been found to be largely ineffective in modern
LLMs. Current methods for training self-correction typically depend on either
multiple models, a more advanced model, or additional forms of supervision. To
address these shortcomings, we develop a multi-turn online reinforcement
learning (RL) approach, SCoRe, that significantly improves an LLM's
self-correction ability using entirely self-generated data. To build SCoRe, we
first show that variants of supervised fine-tuning (SFT) on offline
model-generated correction traces are often insufficient for instilling
self-correction behavior. In particular, we observe that training via SFT falls
prey to either a distribution mismatch between mistakes made by the
data-collection policy and the model's own responses, or to behavior collapse,
where learning implicitly prefers only a certain mode of correction behavior
that is often not effective at self-correction on test problems. SCoRe
addresses these challenges by training under the model's own distribution of
self-generated correction traces and using appropriate regularization to steer
the learning process into learning a self-correction behavior that is effective
at test time as opposed to fitting high-reward responses for a given prompt.
This regularization process includes an initial phase of multi-turn RL on a
base model to generate a policy initialization that is less susceptible to
collapse, followed by using a reward bonus to amplify self-correction. With
Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves
state-of-the-art self-correction performance, improving the base models'
self-correction by 15.6% and 9.1% respectively on MATH and HumanEval. |
---|---|
DOI: | 10.48550/arxiv.2409.12917 |