RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering
Question answering based on retrieval augmented generation (RAG-QA) is an important research topic in NLP and has a wide range of real-world applications. However, most existing datasets for this task are either constructed using a single source corpus or consist of short extractive answers, which f...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Question answering based on retrieval augmented generation (RAG-QA) is an
important research topic in NLP and has a wide range of real-world
applications. However, most existing datasets for this task are either
constructed using a single source corpus or consist of short extractive
answers, which fall short of evaluating large language model (LLM) based RAG-QA
systems on cross-domain generalization. To address these limitations, we create
Long-form RobustQA (LFRQA), a new dataset comprising human-written long-form
answers that integrate short extractive answers from multiple documents into a
single, coherent narrative, covering 26K queries and large corpora across seven
different domains. We further propose RAG-QA Arena by directly comparing
model-generated answers against LFRQA's answers using LLMs as evaluators. We
show via extensive experiments that RAG-QA Arena and human judgments on answer
quality are highly correlated. Moreover, only 41.3% of the most competitive
LLM's answers are preferred to LFRQA's answers, demonstrating RAG-QA Arena as a
challenging evaluation platform for future research. |
---|---|
DOI: | 10.48550/arxiv.2407.13998 |