AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models

Evaluation is critical for assessing capabilities, tracking scientific progress, and informing model selection. In this paper, we present three desiderata for a good benchmark for language models: (i) salience (e.g., knowledge about World War II is more salient than a random day in history), (ii) no...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Li, Xiang Lisa, Liu, Evan Zheran, Liang, Percy, Hashimoto, Tatsunori
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Li, Xiang Lisa
Liu, Evan Zheran
Liang, Percy
Hashimoto, Tatsunori
description Evaluation is critical for assessing capabilities, tracking scientific progress, and informing model selection. In this paper, we present three desiderata for a good benchmark for language models: (i) salience (e.g., knowledge about World War II is more salient than a random day in history), (ii) novelty (i.e., the benchmark reveals new trends in model rankings not shown by previous benchmarks), and (iii) difficulty (i.e., the benchmark should be difficult for existing models, leaving headroom for future improvement). We operationalize these three desiderata and cast benchmark creation as a search problem, that of finding benchmarks that that satisfy all three desiderata. To tackle this search problem, we present AutoBencher, which uses a language model to automatically search for datasets that meet the three desiderata. AutoBencher uses privileged information (e.g. relevant documents) to construct reliable datasets, and adaptivity with reranking to optimize for the search objective. We use AutoBencher to create datasets for math, multilingual, and knowledge-intensive question answering. The scalability of AutoBencher allows it to test fine-grained categories and tail knowledge, creating datasets that are on average 27% more novel and 22% more difficult than existing benchmarks. A closer investigation of our constructed datasets shows that we can identify specific gaps in LM knowledge in language models that are not captured by existing benchmarks, such as Gemini Pro performing much worse on question answering about the Permian Extinction and Fordism, while OpenAGI-7B performing surprisingly well on QA about COVID-19.
doi_str_mv 10.48550/arxiv.2407.08351
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2407_08351</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2407_08351</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2407_083513</originalsourceid><addsrcrecordid>eNqFzbEOgjAUQNEuDkb9ACffByAWgUjcFDQOyKI7ecFXbFJb0xaif28k7k53uclhbB7xMMnSlK_QvmQfrhO-CXkWp9GYlbvOmz3p5k52C7kl9FK3cEElSfsAKtOTCqCQQsimUx4K9OjIOxDGQom67bAlOJsbKTdlI4HK0ezXCVscD9f8tBzc-mnlA-27_vr14Mf_jw-R8Tpf</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models</title><source>arXiv.org</source><creator>Li, Xiang Lisa ; Liu, Evan Zheran ; Liang, Percy ; Hashimoto, Tatsunori</creator><creatorcontrib>Li, Xiang Lisa ; Liu, Evan Zheran ; Liang, Percy ; Hashimoto, Tatsunori</creatorcontrib><description>Evaluation is critical for assessing capabilities, tracking scientific progress, and informing model selection. In this paper, we present three desiderata for a good benchmark for language models: (i) salience (e.g., knowledge about World War II is more salient than a random day in history), (ii) novelty (i.e., the benchmark reveals new trends in model rankings not shown by previous benchmarks), and (iii) difficulty (i.e., the benchmark should be difficult for existing models, leaving headroom for future improvement). We operationalize these three desiderata and cast benchmark creation as a search problem, that of finding benchmarks that that satisfy all three desiderata. To tackle this search problem, we present AutoBencher, which uses a language model to automatically search for datasets that meet the three desiderata. AutoBencher uses privileged information (e.g. relevant documents) to construct reliable datasets, and adaptivity with reranking to optimize for the search objective. We use AutoBencher to create datasets for math, multilingual, and knowledge-intensive question answering. The scalability of AutoBencher allows it to test fine-grained categories and tail knowledge, creating datasets that are on average 27% more novel and 22% more difficult than existing benchmarks. A closer investigation of our constructed datasets shows that we can identify specific gaps in LM knowledge in language models that are not captured by existing benchmarks, such as Gemini Pro performing much worse on question answering about the Permian Extinction and Fordism, while OpenAGI-7B performing surprisingly well on QA about COVID-19.</description><identifier>DOI: 10.48550/arxiv.2407.08351</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Learning</subject><creationdate>2024-07</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2407.08351$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2407.08351$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Li, Xiang Lisa</creatorcontrib><creatorcontrib>Liu, Evan Zheran</creatorcontrib><creatorcontrib>Liang, Percy</creatorcontrib><creatorcontrib>Hashimoto, Tatsunori</creatorcontrib><title>AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models</title><description>Evaluation is critical for assessing capabilities, tracking scientific progress, and informing model selection. In this paper, we present three desiderata for a good benchmark for language models: (i) salience (e.g., knowledge about World War II is more salient than a random day in history), (ii) novelty (i.e., the benchmark reveals new trends in model rankings not shown by previous benchmarks), and (iii) difficulty (i.e., the benchmark should be difficult for existing models, leaving headroom for future improvement). We operationalize these three desiderata and cast benchmark creation as a search problem, that of finding benchmarks that that satisfy all three desiderata. To tackle this search problem, we present AutoBencher, which uses a language model to automatically search for datasets that meet the three desiderata. AutoBencher uses privileged information (e.g. relevant documents) to construct reliable datasets, and adaptivity with reranking to optimize for the search objective. We use AutoBencher to create datasets for math, multilingual, and knowledge-intensive question answering. The scalability of AutoBencher allows it to test fine-grained categories and tail knowledge, creating datasets that are on average 27% more novel and 22% more difficult than existing benchmarks. A closer investigation of our constructed datasets shows that we can identify specific gaps in LM knowledge in language models that are not captured by existing benchmarks, such as Gemini Pro performing much worse on question answering about the Permian Extinction and Fordism, while OpenAGI-7B performing surprisingly well on QA about COVID-19.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFzbEOgjAUQNEuDkb9ACffByAWgUjcFDQOyKI7ecFXbFJb0xaif28k7k53uclhbB7xMMnSlK_QvmQfrhO-CXkWp9GYlbvOmz3p5k52C7kl9FK3cEElSfsAKtOTCqCQQsimUx4K9OjIOxDGQom67bAlOJsbKTdlI4HK0ezXCVscD9f8tBzc-mnlA-27_vr14Mf_jw-R8Tpf</recordid><startdate>20240711</startdate><enddate>20240711</enddate><creator>Li, Xiang Lisa</creator><creator>Liu, Evan Zheran</creator><creator>Liang, Percy</creator><creator>Hashimoto, Tatsunori</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240711</creationdate><title>AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models</title><author>Li, Xiang Lisa ; Liu, Evan Zheran ; Liang, Percy ; Hashimoto, Tatsunori</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2407_083513</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Li, Xiang Lisa</creatorcontrib><creatorcontrib>Liu, Evan Zheran</creatorcontrib><creatorcontrib>Liang, Percy</creatorcontrib><creatorcontrib>Hashimoto, Tatsunori</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Li, Xiang Lisa</au><au>Liu, Evan Zheran</au><au>Liang, Percy</au><au>Hashimoto, Tatsunori</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models</atitle><date>2024-07-11</date><risdate>2024</risdate><abstract>Evaluation is critical for assessing capabilities, tracking scientific progress, and informing model selection. In this paper, we present three desiderata for a good benchmark for language models: (i) salience (e.g., knowledge about World War II is more salient than a random day in history), (ii) novelty (i.e., the benchmark reveals new trends in model rankings not shown by previous benchmarks), and (iii) difficulty (i.e., the benchmark should be difficult for existing models, leaving headroom for future improvement). We operationalize these three desiderata and cast benchmark creation as a search problem, that of finding benchmarks that that satisfy all three desiderata. To tackle this search problem, we present AutoBencher, which uses a language model to automatically search for datasets that meet the three desiderata. AutoBencher uses privileged information (e.g. relevant documents) to construct reliable datasets, and adaptivity with reranking to optimize for the search objective. We use AutoBencher to create datasets for math, multilingual, and knowledge-intensive question answering. The scalability of AutoBencher allows it to test fine-grained categories and tail knowledge, creating datasets that are on average 27% more novel and 22% more difficult than existing benchmarks. A closer investigation of our constructed datasets shows that we can identify specific gaps in LM knowledge in language models that are not captured by existing benchmarks, such as Gemini Pro performing much worse on question answering about the Permian Extinction and Fordism, while OpenAGI-7B performing surprisingly well on QA about COVID-19.</abstract><doi>10.48550/arxiv.2407.08351</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2407.08351
ispartof
issn
language eng
recordid cdi_arxiv_primary_2407_08351
source arXiv.org
subjects Computer Science - Computation and Language
Computer Science - Learning
title AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-30T22%3A34%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=AutoBencher:%20Creating%20Salient,%20Novel,%20Difficult%20Datasets%20for%20Language%20Models&rft.au=Li,%20Xiang%20Lisa&rft.date=2024-07-11&rft_id=info:doi/10.48550/arxiv.2407.08351&rft_dat=%3Carxiv_GOX%3E2407_08351%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true