A Dataset for Evaluating LLM-based Evaluation Functions for Research Question Extraction Task
The progress in text summarization techniques has been remarkable. However the task of accurately extracting and summarizing necessary information from highly specialized documents such as research papers has not been sufficiently investigated. We are focusing on the task of extracting research ques...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The progress in text summarization techniques has been remarkable. However
the task of accurately extracting and summarizing necessary information from
highly specialized documents such as research papers has not been sufficiently
investigated. We are focusing on the task of extracting research questions (RQ)
from research papers and construct a new dataset consisting of machine learning
papers, RQ extracted from these papers by GPT-4, and human evaluations of the
extracted RQ from multiple perspectives. Using this dataset, we systematically
compared recently proposed LLM-based evaluation functions for summarizations,
and found that none of the functions showed sufficiently high correlations with
human evaluations. We expect our dataset provides a foundation for further
research on developing better evaluation functions tailored to the RQ
extraction task, and contribute to enhance the performance of the task. The
dataset is available at https://github.com/auto-res/PaperRQ-HumanAnno-Dataset. |
---|---|
DOI: | 10.48550/arxiv.2409.06883 |