Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

As large language models (LLMs) are applied to more use cases, creating high quality, task-specific datasets for fine-tuning becomes a bottleneck for model improvement. Using high quality human data has been the most common approach to unlock model performance, but is prohibitively expensive in many...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Chan, Yung-Chieh, Pu, George, Shanker, Apaar, Suresh, Parth, Jenks, Penn, Heyer, John, Denton, Sam
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Chan, Yung-Chieh Pu, George Shanker, Apaar Suresh, Parth Jenks, Penn Heyer, John Denton, Sam
description	As large language models (LLMs) are applied to more use cases, creating high quality, task-specific datasets for fine-tuning becomes a bottleneck for model improvement. Using high quality human data has been the most common approach to unlock model performance, but is prohibitively expensive in many scenarios. Several alternative methods have also emerged, such as generating synthetic or hybrid data, but the effectiveness of these approaches remain unclear, especially in resource-constrained scenarios and tasks that are not easily verified. To investigate this, we group various synthetic data generation strategies into three representative categories -- Answer Augmentation, Question Rephrase and New Question -- and study the performance of student LLMs trained under various constraints, namely seed instruction set size and query budget. We demonstrate that these strategies are not equally effective across settings. Notably, the optimal data generation strategy depends strongly on the ratio between the available teacher query budget and the size of the seed instruction set. When this ratio is low, generating new answers to existing questions proves most effective, but as this ratio increases, generating new questions becomes optimal. Across all tasks, we find that choice of augmentation method and other design choices matter substantially more in low to mid data regimes than in high data regimes. We provide a practical framework for selecting the appropriate augmentation method across settings, taking into account additional factors such as the scalability of each method, the importance of verifying synthetic data, and the use of different LLMs for synthetic data generation.
doi_str_mv	10.48550/arxiv.2409.19759
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2409_19759</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2409_19759</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2409_197593</originalsourceid><addsrcrecordid>eNqFjrEOgkAQRK-xMOoHWLk_4AkKUVoRtcDCYE82uIeX4J252xD5e5HYW80k85J5QszDQEa7OA5W6N66lesoSGSYbONkLK57bNBU2tSQWs-A5g6ZUlSxbsmQ92AVFJ3hB7Gu4ICMcOoHh6ytgYL7QrUmD8o6yPOLn4qRwsbT7JcTsThmt_S8HM7Ll9NPdF35lSgHic1_4gMYmjy0</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs</title><source>arXiv.org</source><creator>Chan, Yung-Chieh ; Pu, George ; Shanker, Apaar ; Suresh, Parth ; Jenks, Penn ; Heyer, John ; Denton, Sam</creator><creatorcontrib>Chan, Yung-Chieh ; Pu, George ; Shanker, Apaar ; Suresh, Parth ; Jenks, Penn ; Heyer, John ; Denton, Sam</creatorcontrib><description>As large language models (LLMs) are applied to more use cases, creating high quality, task-specific datasets for fine-tuning becomes a bottleneck for model improvement. Using high quality human data has been the most common approach to unlock model performance, but is prohibitively expensive in many scenarios. Several alternative methods have also emerged, such as generating synthetic or hybrid data, but the effectiveness of these approaches remain unclear, especially in resource-constrained scenarios and tasks that are not easily verified. To investigate this, we group various synthetic data generation strategies into three representative categories -- Answer Augmentation, Question Rephrase and New Question -- and study the performance of student LLMs trained under various constraints, namely seed instruction set size and query budget. We demonstrate that these strategies are not equally effective across settings. Notably, the optimal data generation strategy depends strongly on the ratio between the available teacher query budget and the size of the seed instruction set. When this ratio is low, generating new answers to existing questions proves most effective, but as this ratio increases, generating new questions becomes optimal. Across all tasks, we find that choice of augmentation method and other design choices matter substantially more in low to mid data regimes than in high data regimes. We provide a practical framework for selecting the appropriate augmentation method across settings, taking into account additional factors such as the scalability of each method, the importance of verifying synthetic data, and the use of different LLMs for synthetic data generation.</description><identifier>DOI: 10.48550/arxiv.2409.19759</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Learning</subject><creationdate>2024-09</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2409.19759$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2409.19759$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Chan, Yung-Chieh</creatorcontrib><creatorcontrib>Pu, George</creatorcontrib><creatorcontrib>Shanker, Apaar</creatorcontrib><creatorcontrib>Suresh, Parth</creatorcontrib><creatorcontrib>Jenks, Penn</creatorcontrib><creatorcontrib>Heyer, John</creatorcontrib><creatorcontrib>Denton, Sam</creatorcontrib><title>Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs</title><description>As large language models (LLMs) are applied to more use cases, creating high quality, task-specific datasets for fine-tuning becomes a bottleneck for model improvement. Using high quality human data has been the most common approach to unlock model performance, but is prohibitively expensive in many scenarios. Several alternative methods have also emerged, such as generating synthetic or hybrid data, but the effectiveness of these approaches remain unclear, especially in resource-constrained scenarios and tasks that are not easily verified. To investigate this, we group various synthetic data generation strategies into three representative categories -- Answer Augmentation, Question Rephrase and New Question -- and study the performance of student LLMs trained under various constraints, namely seed instruction set size and query budget. We demonstrate that these strategies are not equally effective across settings. Notably, the optimal data generation strategy depends strongly on the ratio between the available teacher query budget and the size of the seed instruction set. When this ratio is low, generating new answers to existing questions proves most effective, but as this ratio increases, generating new questions becomes optimal. Across all tasks, we find that choice of augmentation method and other design choices matter substantially more in low to mid data regimes than in high data regimes. We provide a practical framework for selecting the appropriate augmentation method across settings, taking into account additional factors such as the scalability of each method, the importance of verifying synthetic data, and the use of different LLMs for synthetic data generation.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFjrEOgkAQRK-xMOoHWLk_4AkKUVoRtcDCYE82uIeX4J252xD5e5HYW80k85J5QszDQEa7OA5W6N66lesoSGSYbONkLK57bNBU2tSQWs-A5g6ZUlSxbsmQ92AVFJ3hB7Gu4ICMcOoHh6ytgYL7QrUmD8o6yPOLn4qRwsbT7JcTsThmt_S8HM7Ll9NPdF35lSgHic1_4gMYmjy0</recordid><startdate>20240929</startdate><enddate>20240929</enddate><creator>Chan, Yung-Chieh</creator><creator>Pu, George</creator><creator>Shanker, Apaar</creator><creator>Suresh, Parth</creator><creator>Jenks, Penn</creator><creator>Heyer, John</creator><creator>Denton, Sam</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240929</creationdate><title>Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs</title><author>Chan, Yung-Chieh ; Pu, George ; Shanker, Apaar ; Suresh, Parth ; Jenks, Penn ; Heyer, John ; Denton, Sam</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2409_197593</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Chan, Yung-Chieh</creatorcontrib><creatorcontrib>Pu, George</creatorcontrib><creatorcontrib>Shanker, Apaar</creatorcontrib><creatorcontrib>Suresh, Parth</creatorcontrib><creatorcontrib>Jenks, Penn</creatorcontrib><creatorcontrib>Heyer, John</creatorcontrib><creatorcontrib>Denton, Sam</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chan, Yung-Chieh</au><au>Pu, George</au><au>Shanker, Apaar</au><au>Suresh, Parth</au><au>Jenks, Penn</au><au>Heyer, John</au><au>Denton, Sam</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs</atitle><date>2024-09-29</date><risdate>2024</risdate><abstract>As large language models (LLMs) are applied to more use cases, creating high quality, task-specific datasets for fine-tuning becomes a bottleneck for model improvement. Using high quality human data has been the most common approach to unlock model performance, but is prohibitively expensive in many scenarios. Several alternative methods have also emerged, such as generating synthetic or hybrid data, but the effectiveness of these approaches remain unclear, especially in resource-constrained scenarios and tasks that are not easily verified. To investigate this, we group various synthetic data generation strategies into three representative categories -- Answer Augmentation, Question Rephrase and New Question -- and study the performance of student LLMs trained under various constraints, namely seed instruction set size and query budget. We demonstrate that these strategies are not equally effective across settings. Notably, the optimal data generation strategy depends strongly on the ratio between the available teacher query budget and the size of the seed instruction set. When this ratio is low, generating new answers to existing questions proves most effective, but as this ratio increases, generating new questions becomes optimal. Across all tasks, we find that choice of augmentation method and other design choices matter substantially more in low to mid data regimes than in high data regimes. We provide a practical framework for selecting the appropriate augmentation method across settings, taking into account additional factors such as the scalability of each method, the importance of verifying synthetic data, and the use of different LLMs for synthetic data generation.</abstract><doi>10.48550/arxiv.2409.19759</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2409.19759
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2409_19759
source	arXiv.org
subjects	Computer Science - Computation and Language Computer Science - Learning
title	Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-31T17%3A00%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Balancing%20Cost%20and%20Effectiveness%20of%20Synthetic%20Data%20Generation%20Strategies%20for%20LLMs&rft.au=Chan,%20Yung-Chieh&rft.date=2024-09-29&rft_id=info:doi/10.48550/arxiv.2409.19759&rft_dat=%3Carxiv_GOX%3E2409_19759%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true