DHP Benchmark: Are LLMs Good NLG Evaluators?
Large Language Models (LLMs) are increasingly serving as evaluators in Natural Language Generation (NLG) tasks. However, the capabilities of LLMs in scoring NLG quality remain inadequately explored. Current studies depend on human assessments and simple metrics that fail to capture the discernment o...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Large Language Models (LLMs) are increasingly serving as evaluators in
Natural Language Generation (NLG) tasks. However, the capabilities of LLMs in
scoring NLG quality remain inadequately explored. Current studies depend on
human assessments and simple metrics that fail to capture the discernment of
LLMs across diverse NLG tasks. To address this gap, we propose the Discernment
of Hierarchical Perturbation (DHP) benchmarking framework, which provides
quantitative discernment scores for LLMs utilizing hierarchically perturbed
text data and statistical tests to measure the NLG evaluation capabilities of
LLMs systematically. We have re-established six evaluation datasets for this
benchmark, covering four NLG tasks: Summarization, Story Completion, Question
Answering, and Translation. Our comprehensive benchmarking of five major LLM
series provides critical insight into their strengths and limitations as NLG
evaluators. |
---|---|
DOI: | 10.48550/arxiv.2408.13704 |