Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

As large language models (LLMs) are applied to more use cases, creating high quality, task-specific datasets for fine-tuning becomes a bottleneck for model improvement. Using high quality human data has been the most common approach to unlock model performance, but is prohibitively expensive in many...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Chan, Yung-Chieh, Pu, George, Shanker, Apaar, Suresh, Parth, Jenks, Penn, Heyer, John, Denton, Sam
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Chan, Yung-Chieh
Pu, George
Shanker, Apaar
Suresh, Parth
Jenks, Penn
Heyer, John
Denton, Sam
description As large language models (LLMs) are applied to more use cases, creating high quality, task-specific datasets for fine-tuning becomes a bottleneck for model improvement. Using high quality human data has been the most common approach to unlock model performance, but is prohibitively expensive in many scenarios. Several alternative methods have also emerged, such as generating synthetic or hybrid data, but the effectiveness of these approaches remain unclear, especially in resource-constrained scenarios and tasks that are not easily verified. To investigate this, we group various synthetic data generation strategies into three representative categories -- Answer Augmentation, Question Rephrase and New Question -- and study the performance of student LLMs trained under various constraints, namely seed instruction set size and query budget. We demonstrate that these strategies are not equally effective across settings. Notably, the optimal data generation strategy depends strongly on the ratio between the available teacher query budget and the size of the seed instruction set. When this ratio is low, generating new answers to existing questions proves most effective, but as this ratio increases, generating new questions becomes optimal. Across all tasks, we find that choice of augmentation method and other design choices matter substantially more in low to mid data regimes than in high data regimes. We provide a practical framework for selecting the appropriate augmentation method across settings, taking into account additional factors such as the scalability of each method, the importance of verifying synthetic data, and the use of different LLMs for synthetic data generation.
doi_str_mv 10.48550/arxiv.2409.19759
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2409_19759</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2409_19759</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2409_197593</originalsourceid><addsrcrecordid>eNqFjrEOgkAQRK-xMOoHWLk_4AkKUVoRtcDCYE82uIeX4J252xD5e5HYW80k85J5QszDQEa7OA5W6N66lesoSGSYbONkLK57bNBU2tSQWs-A5g6ZUlSxbsmQ92AVFJ3hB7Gu4ICMcOoHh6ytgYL7QrUmD8o6yPOLn4qRwsbT7JcTsThmt_S8HM7Ll9NPdF35lSgHic1_4gMYmjy0</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs</title><source>arXiv.org</source><creator>Chan, Yung-Chieh ; Pu, George ; Shanker, Apaar ; Suresh, Parth ; Jenks, Penn ; Heyer, John ; Denton, Sam</creator><creatorcontrib>Chan, Yung-Chieh ; Pu, George ; Shanker, Apaar ; Suresh, Parth ; Jenks, Penn ; Heyer, John ; Denton, Sam</creatorcontrib><description>As large language models (LLMs) are applied to more use cases, creating high quality, task-specific datasets for fine-tuning becomes a bottleneck for model improvement. Using high quality human data has been the most common approach to unlock model performance, but is prohibitively expensive in many scenarios. Several alternative methods have also emerged, such as generating synthetic or hybrid data, but the effectiveness of these approaches remain unclear, especially in resource-constrained scenarios and tasks that are not easily verified. To investigate this, we group various synthetic data generation strategies into three representative categories -- Answer Augmentation, Question Rephrase and New Question -- and study the performance of student LLMs trained under various constraints, namely seed instruction set size and query budget. We demonstrate that these strategies are not equally effective across settings. Notably, the optimal data generation strategy depends strongly on the ratio between the available teacher query budget and the size of the seed instruction set. When this ratio is low, generating new answers to existing questions proves most effective, but as this ratio increases, generating new questions becomes optimal. Across all tasks, we find that choice of augmentation method and other design choices matter substantially more in low to mid data regimes than in high data regimes. We provide a practical framework for selecting the appropriate augmentation method across settings, taking into account additional factors such as the scalability of each method, the importance of verifying synthetic data, and the use of different LLMs for synthetic data generation.</description><identifier>DOI: 10.48550/arxiv.2409.19759</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Learning</subject><creationdate>2024-09</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2409.19759$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2409.19759$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Chan, Yung-Chieh</creatorcontrib><creatorcontrib>Pu, George</creatorcontrib><creatorcontrib>Shanker, Apaar</creatorcontrib><creatorcontrib>Suresh, Parth</creatorcontrib><creatorcontrib>Jenks, Penn</creatorcontrib><creatorcontrib>Heyer, John</creatorcontrib><creatorcontrib>Denton, Sam</creatorcontrib><title>Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs</title><description>As large language models (LLMs) are applied to more use cases, creating high quality, task-specific datasets for fine-tuning becomes a bottleneck for model improvement. Using high quality human data has been the most common approach to unlock model performance, but is prohibitively expensive in many scenarios. Several alternative methods have also emerged, such as generating synthetic or hybrid data, but the effectiveness of these approaches remain unclear, especially in resource-constrained scenarios and tasks that are not easily verified. To investigate this, we group various synthetic data generation strategies into three representative categories -- Answer Augmentation, Question Rephrase and New Question -- and study the performance of student LLMs trained under various constraints, namely seed instruction set size and query budget. We demonstrate that these strategies are not equally effective across settings. Notably, the optimal data generation strategy depends strongly on the ratio between the available teacher query budget and the size of the seed instruction set. When this ratio is low, generating new answers to existing questions proves most effective, but as this ratio increases, generating new questions becomes optimal. Across all tasks, we find that choice of augmentation method and other design choices matter substantially more in low to mid data regimes than in high data regimes. We provide a practical framework for selecting the appropriate augmentation method across settings, taking into account additional factors such as the scalability of each method, the importance of verifying synthetic data, and the use of different LLMs for synthetic data generation.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFjrEOgkAQRK-xMOoHWLk_4AkKUVoRtcDCYE82uIeX4J252xD5e5HYW80k85J5QszDQEa7OA5W6N66lesoSGSYbONkLK57bNBU2tSQWs-A5g6ZUlSxbsmQ92AVFJ3hB7Gu4ICMcOoHh6ytgYL7QrUmD8o6yPOLn4qRwsbT7JcTsThmt_S8HM7Ll9NPdF35lSgHic1_4gMYmjy0</recordid><startdate>20240929</startdate><enddate>20240929</enddate><creator>Chan, Yung-Chieh</creator><creator>Pu, George</creator><creator>Shanker, Apaar</creator><creator>Suresh, Parth</creator><creator>Jenks, Penn</creator><creator>Heyer, John</creator><creator>Denton, Sam</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240929</creationdate><title>Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs</title><author>Chan, Yung-Chieh ; Pu, George ; Shanker, Apaar ; Suresh, Parth ; Jenks, Penn ; Heyer, John ; Denton, Sam</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2409_197593</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Chan, Yung-Chieh</creatorcontrib><creatorcontrib>Pu, George</creatorcontrib><creatorcontrib>Shanker, Apaar</creatorcontrib><creatorcontrib>Suresh, Parth</creatorcontrib><creatorcontrib>Jenks, Penn</creatorcontrib><creatorcontrib>Heyer, John</creatorcontrib><creatorcontrib>Denton, Sam</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chan, Yung-Chieh</au><au>Pu, George</au><au>Shanker, Apaar</au><au>Suresh, Parth</au><au>Jenks, Penn</au><au>Heyer, John</au><au>Denton, Sam</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs</atitle><date>2024-09-29</date><risdate>2024</risdate><abstract>As large language models (LLMs) are applied to more use cases, creating high quality, task-specific datasets for fine-tuning becomes a bottleneck for model improvement. Using high quality human data has been the most common approach to unlock model performance, but is prohibitively expensive in many scenarios. Several alternative methods have also emerged, such as generating synthetic or hybrid data, but the effectiveness of these approaches remain unclear, especially in resource-constrained scenarios and tasks that are not easily verified. To investigate this, we group various synthetic data generation strategies into three representative categories -- Answer Augmentation, Question Rephrase and New Question -- and study the performance of student LLMs trained under various constraints, namely seed instruction set size and query budget. We demonstrate that these strategies are not equally effective across settings. Notably, the optimal data generation strategy depends strongly on the ratio between the available teacher query budget and the size of the seed instruction set. When this ratio is low, generating new answers to existing questions proves most effective, but as this ratio increases, generating new questions becomes optimal. Across all tasks, we find that choice of augmentation method and other design choices matter substantially more in low to mid data regimes than in high data regimes. We provide a practical framework for selecting the appropriate augmentation method across settings, taking into account additional factors such as the scalability of each method, the importance of verifying synthetic data, and the use of different LLMs for synthetic data generation.</abstract><doi>10.48550/arxiv.2409.19759</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2409.19759
ispartof
issn
language eng
recordid cdi_arxiv_primary_2409_19759
source arXiv.org
subjects Computer Science - Computation and Language
Computer Science - Learning
title Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-31T17%3A00%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Balancing%20Cost%20and%20Effectiveness%20of%20Synthetic%20Data%20Generation%20Strategies%20for%20LLMs&rft.au=Chan,%20Yung-Chieh&rft.date=2024-09-29&rft_id=info:doi/10.48550/arxiv.2409.19759&rft_dat=%3Carxiv_GOX%3E2409_19759%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true