Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning
Effective training of language models (LMs) for mathematical reasoning tasks demands high-quality supervised fine-tuning data. Besides obtaining annotations from human experts, a common alternative is sampling from larger and more powerful LMs. However, this knowledge distillation approach can be co...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Effective training of language models (LMs) for mathematical reasoning tasks
demands high-quality supervised fine-tuning data. Besides obtaining annotations
from human experts, a common alternative is sampling from larger and more
powerful LMs. However, this knowledge distillation approach can be costly and
unstable, particularly when relying on closed-source, proprietary LMs like
GPT-4, whose behaviors are often unpredictable. In this work, we demonstrate
that the reasoning abilities of small-scale LMs can be enhanced through
self-training, a process where models learn from their own outputs. We also
show that the conventional self-training can be further augmented by a
preference learning algorithm called Direct Preference Optimization (DPO). By
integrating DPO into self-training, we leverage preference data to guide LMs
towards more accurate and diverse chain-of-thought reasoning. We evaluate our
method across various mathematical reasoning tasks using different base models.
Our experiments show that this approach not only improves LMs' reasoning
performance but also offers a more cost-effective and scalable solution
compared to relying on large proprietary LMs. |
---|---|
DOI: | 10.48550/arxiv.2407.18248 |