Benchmarking Large Language Model Capabilities for Conditional Generation

Pre-trained large language models (PLMs) underlie most new developments in natural language processing. They have shifted the field from application-specific model pipelines to a single model that is adapted to a wide range of tasks. Autoregressive PLMs like GPT-3 or PaLM, alongside techniques like...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Maynez, Joshua, Agrawal, Priyanka, Gehrmann, Sebastian
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Maynez, Joshua
Agrawal, Priyanka
Gehrmann, Sebastian
description Pre-trained large language models (PLMs) underlie most new developments in natural language processing. They have shifted the field from application-specific model pipelines to a single model that is adapted to a wide range of tasks. Autoregressive PLMs like GPT-3 or PaLM, alongside techniques like few-shot learning, have additionally shifted the output modality to generation instead of classification or regression. Despite their ubiquitous use, the generation quality of language models is rarely evaluated when these models are introduced. Additionally, it is unclear how existing generation tasks--while they can be used to compare systems at a high level--relate to the real world use cases for which people have been adopting them. In this work, we discuss how to adapt existing application-specific generation benchmarks to PLMs and provide an in-depth, empirical study of the limitations and capabilities of PLMs in natural language generation tasks along dimensions such as scale, architecture, input and output language. Our results show that PLMs differ in their applicability to different data regimes and their generalization to multiple languages and inform which PLMs to use for a given generation task setup. We share best practices to be taken into consideration when benchmarking generation capabilities during the development of upcoming PLMs.
doi_str_mv 10.48550/arxiv.2306.16793
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2306_16793</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2306_16793</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-b69632608b2dcb613e2200e72828ea8bb4183a5648cc12545e2a45859e2542d53</originalsourceid><addsrcrecordid>eNotj8tOwzAQRb1hgQofwAr_QFJ7_IizbCMolVKx6T4aJ9NgkTqVCwj-HrewmTtnc3UPYw9SlNoZI5aYvsNXCUrYUtqqVrdsu6bYvx0xvYc48hbTSPnG8RPzs5sHmniDJ_RhCh-BzvwwJ97Mccg0R5z4hiIlvMAduzngdKb7_1yw_fPTvnkp2tfNtlm1BdpKFd7WVoEVzsPQeysVAQhBFThwhM57LZ1CY7XrewlGGwLUxpmaMsBg1II9_tVeXbpTCnn8T3dx6q5O6heo30Xu</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Benchmarking Large Language Model Capabilities for Conditional Generation</title><source>arXiv.org</source><creator>Maynez, Joshua ; Agrawal, Priyanka ; Gehrmann, Sebastian</creator><creatorcontrib>Maynez, Joshua ; Agrawal, Priyanka ; Gehrmann, Sebastian</creatorcontrib><description>Pre-trained large language models (PLMs) underlie most new developments in natural language processing. They have shifted the field from application-specific model pipelines to a single model that is adapted to a wide range of tasks. Autoregressive PLMs like GPT-3 or PaLM, alongside techniques like few-shot learning, have additionally shifted the output modality to generation instead of classification or regression. Despite their ubiquitous use, the generation quality of language models is rarely evaluated when these models are introduced. Additionally, it is unclear how existing generation tasks--while they can be used to compare systems at a high level--relate to the real world use cases for which people have been adopting them. In this work, we discuss how to adapt existing application-specific generation benchmarks to PLMs and provide an in-depth, empirical study of the limitations and capabilities of PLMs in natural language generation tasks along dimensions such as scale, architecture, input and output language. Our results show that PLMs differ in their applicability to different data regimes and their generalization to multiple languages and inform which PLMs to use for a given generation task setup. We share best practices to be taken into consideration when benchmarking generation capabilities during the development of upcoming PLMs.</description><identifier>DOI: 10.48550/arxiv.2306.16793</identifier><language>eng</language><subject>Computer Science - Computation and Language</subject><creationdate>2023-06</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2306.16793$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2306.16793$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Maynez, Joshua</creatorcontrib><creatorcontrib>Agrawal, Priyanka</creatorcontrib><creatorcontrib>Gehrmann, Sebastian</creatorcontrib><title>Benchmarking Large Language Model Capabilities for Conditional Generation</title><description>Pre-trained large language models (PLMs) underlie most new developments in natural language processing. They have shifted the field from application-specific model pipelines to a single model that is adapted to a wide range of tasks. Autoregressive PLMs like GPT-3 or PaLM, alongside techniques like few-shot learning, have additionally shifted the output modality to generation instead of classification or regression. Despite their ubiquitous use, the generation quality of language models is rarely evaluated when these models are introduced. Additionally, it is unclear how existing generation tasks--while they can be used to compare systems at a high level--relate to the real world use cases for which people have been adopting them. In this work, we discuss how to adapt existing application-specific generation benchmarks to PLMs and provide an in-depth, empirical study of the limitations and capabilities of PLMs in natural language generation tasks along dimensions such as scale, architecture, input and output language. Our results show that PLMs differ in their applicability to different data regimes and their generalization to multiple languages and inform which PLMs to use for a given generation task setup. We share best practices to be taken into consideration when benchmarking generation capabilities during the development of upcoming PLMs.</description><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tOwzAQRb1hgQofwAr_QFJ7_IizbCMolVKx6T4aJ9NgkTqVCwj-HrewmTtnc3UPYw9SlNoZI5aYvsNXCUrYUtqqVrdsu6bYvx0xvYc48hbTSPnG8RPzs5sHmniDJ_RhCh-BzvwwJ97Mccg0R5z4hiIlvMAduzngdKb7_1yw_fPTvnkp2tfNtlm1BdpKFd7WVoEVzsPQeysVAQhBFThwhM57LZ1CY7XrewlGGwLUxpmaMsBg1II9_tVeXbpTCnn8T3dx6q5O6heo30Xu</recordid><startdate>20230629</startdate><enddate>20230629</enddate><creator>Maynez, Joshua</creator><creator>Agrawal, Priyanka</creator><creator>Gehrmann, Sebastian</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230629</creationdate><title>Benchmarking Large Language Model Capabilities for Conditional Generation</title><author>Maynez, Joshua ; Agrawal, Priyanka ; Gehrmann, Sebastian</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-b69632608b2dcb613e2200e72828ea8bb4183a5648cc12545e2a45859e2542d53</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Maynez, Joshua</creatorcontrib><creatorcontrib>Agrawal, Priyanka</creatorcontrib><creatorcontrib>Gehrmann, Sebastian</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Maynez, Joshua</au><au>Agrawal, Priyanka</au><au>Gehrmann, Sebastian</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Benchmarking Large Language Model Capabilities for Conditional Generation</atitle><date>2023-06-29</date><risdate>2023</risdate><abstract>Pre-trained large language models (PLMs) underlie most new developments in natural language processing. They have shifted the field from application-specific model pipelines to a single model that is adapted to a wide range of tasks. Autoregressive PLMs like GPT-3 or PaLM, alongside techniques like few-shot learning, have additionally shifted the output modality to generation instead of classification or regression. Despite their ubiquitous use, the generation quality of language models is rarely evaluated when these models are introduced. Additionally, it is unclear how existing generation tasks--while they can be used to compare systems at a high level--relate to the real world use cases for which people have been adopting them. In this work, we discuss how to adapt existing application-specific generation benchmarks to PLMs and provide an in-depth, empirical study of the limitations and capabilities of PLMs in natural language generation tasks along dimensions such as scale, architecture, input and output language. Our results show that PLMs differ in their applicability to different data regimes and their generalization to multiple languages and inform which PLMs to use for a given generation task setup. We share best practices to be taken into consideration when benchmarking generation capabilities during the development of upcoming PLMs.</abstract><doi>10.48550/arxiv.2306.16793</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2306.16793
ispartof
issn
language eng
recordid cdi_arxiv_primary_2306_16793
source arXiv.org
subjects Computer Science - Computation and Language
title Benchmarking Large Language Model Capabilities for Conditional Generation
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T18%3A49%3A41IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Benchmarking%20Large%20Language%20Model%20Capabilities%20for%20Conditional%20Generation&rft.au=Maynez,%20Joshua&rft.date=2023-06-29&rft_id=info:doi/10.48550/arxiv.2306.16793&rft_dat=%3Carxiv_GOX%3E2306_16793%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true