How not to Lie with a Benchmark: Rearranging NLP Leaderboards
Comparison with a human is an essential requirement for a benchmark for it to be a reliable measurement of model capabilities. Nevertheless, the methods for model comparison could have a fundamental flaw - the arithmetic mean of separate metrics is used for all tasks of different complexity, differe...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Comparison with a human is an essential requirement for a benchmark for it to
be a reliable measurement of model capabilities. Nevertheless, the methods for
model comparison could have a fundamental flaw - the arithmetic mean of
separate metrics is used for all tasks of different complexity, different size
of test and training sets.
In this paper, we examine popular NLP benchmarks' overall scoring methods and
rearrange the models by geometric and harmonic mean (appropriate for averaging
rates) according to their reported results. We analyze several popular
benchmarks including GLUE, SuperGLUE, XGLUE, and XTREME. The analysis shows
that e.g. human level on SuperGLUE is still not reached, and there is still
room for improvement for the current models. |
---|---|
DOI: | 10.48550/arxiv.2112.01342 |