CUGE: A Chinese Language Understanding and Generation Evaluation Benchmark
Realizing general-purpose language intelligence has been a longstanding goal for natural language processing, where standard evaluation benchmarks play a fundamental and guiding role. We argue that for general-purpose language intelligence evaluation, the benchmark itself needs to be comprehensive a...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Realizing general-purpose language intelligence has been a longstanding goal
for natural language processing, where standard evaluation benchmarks play a
fundamental and guiding role. We argue that for general-purpose language
intelligence evaluation, the benchmark itself needs to be comprehensive and
systematic. To this end, we propose CUGE, a Chinese Language Understanding and
Generation Evaluation benchmark with the following features: (1) Hierarchical
benchmark framework, where datasets are principally selected and organized with
a language capability-task-dataset hierarchy. (2) Multi-level scoring strategy,
where different levels of model performance are provided based on the
hierarchical framework. To facilitate CUGE, we provide a public leaderboard
that can be customized to support flexible model judging criteria. Evaluation
results on representative pre-trained language models indicate ample room for
improvement towards general-purpose language intelligence. CUGE is publicly
available at cuge.baai.ac.cn. |
---|---|
DOI: | 10.48550/arxiv.2112.13610 |