RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold
Training on model-generated synthetic data is a promising approach for finetuning LLMs, but it remains unclear when it helps or hurts. In this paper, we investigate this question for math reasoning via an empirical study, followed by building a conceptual understanding of our observations. First, we...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Training on model-generated synthetic data is a promising approach for
finetuning LLMs, but it remains unclear when it helps or hurts. In this paper,
we investigate this question for math reasoning via an empirical study,
followed by building a conceptual understanding of our observations. First, we
find that while the typical approach of finetuning a model on synthetic correct
or positive problem-solution pairs generated by capable models offers modest
performance gains, sampling more correct solutions from the finetuned learner
itself followed by subsequent fine-tuning on this self-generated data
$\textbf{doubles}$ the efficiency of the same synthetic problems. At the same
time, training on model-generated positives can amplify various spurious
correlations, resulting in flat or even inverse scaling trends as the amount of
data increases. Surprisingly, we find that several of these issues can be
addressed if we also utilize negative responses, i.e., model-generated
responses that are deemed incorrect by a final answer verifier. Crucially,
these negatives must be constructed such that the training can appropriately
recover the utility or advantage of each intermediate step in the negative
response. With this per-step scheme, we are able to attain consistent gains
over only positive data, attaining performance similar to amplifying the amount
of synthetic data by $\mathbf{8 \times}$. We show that training on per-step
negatives can help to unlearn spurious correlations in the positive data, and
is equivalent to advantage-weighted reinforcement learning (RL), implying that
it inherits robustness benefits of RL over imitating positive data alone. |
---|---|
DOI: | 10.48550/arxiv.2406.14532 |