Large Language Models are Diverse Role-Players for Summarization Evaluation
Text summarization has a wide range of applications in many scenarios. The evaluation of the quality of the generated text is a complex problem. A big challenge to language evaluation is that there is a clear divergence between existing metrics and human evaluation. A document summary's quality...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Text summarization has a wide range of applications in many scenarios. The
evaluation of the quality of the generated text is a complex problem. A big
challenge to language evaluation is that there is a clear divergence between
existing metrics and human evaluation. A document summary's quality can be
assessed by human annotators on various criteria, both objective ones like
grammar and correctness, and subjective ones like informativeness,
succinctness, and appeal. Most of the automatic evaluation methods like
BLUE/ROUGE may be not able to adequately capture the above dimensions. In this
paper, we propose a new evaluation framework based on LLMs, which provides a
comprehensive evaluation framework by comparing generated text and reference
text from both objective and subjective aspects. First, we propose to model
objective and subjective dimensions of generated text based on roleplayers
prompting mechanism. Furthermore, we introduce a context-based prompting
mechanism that is able to generate dynamic roleplayer profiles based on input
context. Finally, we design a multi-roleplayer prompting technology based on
batch prompting and integrate multiple outputs into the final evaluation
results. Experimental results on three real datasets for summarization show
that our model is highly competitive and has a very high consistency with human
annotators. |
---|---|
DOI: | 10.48550/arxiv.2303.15078 |