Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs
As large language models (LLMs) are applied to more use cases, creating high quality, task-specific datasets for fine-tuning becomes a bottleneck for model improvement. Using high quality human data has been the most common approach to unlock model performance, but is prohibitively expensive in many...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | As large language models (LLMs) are applied to more use cases, creating high
quality, task-specific datasets for fine-tuning becomes a bottleneck for model
improvement. Using high quality human data has been the most common approach to
unlock model performance, but is prohibitively expensive in many scenarios.
Several alternative methods have also emerged, such as generating synthetic or
hybrid data, but the effectiveness of these approaches remain unclear,
especially in resource-constrained scenarios and tasks that are not easily
verified. To investigate this, we group various synthetic data generation
strategies into three representative categories -- Answer Augmentation,
Question Rephrase and New Question -- and study the performance of student LLMs
trained under various constraints, namely seed instruction set size and query
budget. We demonstrate that these strategies are not equally effective across
settings. Notably, the optimal data generation strategy depends strongly on the
ratio between the available teacher query budget and the size of the seed
instruction set. When this ratio is low, generating new answers to existing
questions proves most effective, but as this ratio increases, generating new
questions becomes optimal. Across all tasks, we find that choice of
augmentation method and other design choices matter substantially more in low
to mid data regimes than in high data regimes. We provide a practical framework
for selecting the appropriate augmentation method across settings, taking into
account additional factors such as the scalability of each method, the
importance of verifying synthetic data, and the use of different LLMs for
synthetic data generation. |
---|---|
DOI: | 10.48550/arxiv.2409.19759 |