Accurate and Nuanced Open-QA Evaluation Through Textual Entailment
Open-domain question answering (Open-QA) is a common task for evaluating large language models (LLMs). However, current Open-QA evaluations are criticized for the ambiguity in questions and the lack of semantic understanding in evaluators. Complex evaluators, powered by foundation models or LLMs and...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Open-domain question answering (Open-QA) is a common task for evaluating
large language models (LLMs). However, current Open-QA evaluations are
criticized for the ambiguity in questions and the lack of semantic
understanding in evaluators. Complex evaluators, powered by foundation models
or LLMs and pertaining to semantic equivalence, still deviate from human
judgments by a large margin. We propose to study the entailment relations of
answers to identify more informative and more general system answers, offering
a much closer evaluation to human judgment on both NaturalQuestions and
TriviaQA while being learning-free. The entailment-based evaluation we propose
allows the assignment of bonus or partial marks by quantifying the inference
gap between answers, enabling a nuanced ranking of answer correctness that has
higher AUC than current methods. |
---|---|
DOI: | 10.48550/arxiv.2405.16702 |