State of What Art? A Call for Multi-Prompt LLM Evaluation
Recent advances in LLMs have led to an abundance of evaluation benchmarks, which typically rely on a per task. We create a large-scale collection of instruction paraphrases and comprehensively analyze the brittleness introduced by single-prompt evaluations across 6.5M instances, involving 20 differe...
Gespeichert in:
Veröffentlicht in: | Transactions of the Association for Computational Linguistics 2024-08, Vol.12, p.933-949 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recent advances in LLMs have led to an abundance of evaluation benchmarks, which typically rely on a
per task. We create a large-scale collection of instruction paraphrases and comprehensively analyze the brittleness introduced by single-prompt evaluations across 6.5M instances, involving 20 different LLMs and 39 tasks from 3 benchmarks. We find that different instruction templates lead to very different performance, both absolute and relative. Instead, we propose a set of diverse metrics on
, specifically tailored for different use cases (e.g., LLM vs. downstream development), ensuring a more reliable and meaningful assessment of LLM capabilities. We show that our metrics provide new insights into the strengths and limitations of current LLMs. |
---|---|
ISSN: | 2307-387X 2307-387X |
DOI: | 10.1162/tacl_a_00681 |