A survey on evaluation of summarization methods
•Manual assessment is not re-usable.•Re-use of the gold standard by non-participants is often problematic.•Overlap-based metrics are not suitable for full text comparison-based evaluation.•GRAD exceeds word-based metrics to distinguish between generated and human written summaries.•Overlap metrics a...
Gespeichert in:
Veröffentlicht in: | Information processing & management 2019-09, Vol.56 (5), p.1794-1814 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | •Manual assessment is not re-usable.•Re-use of the gold standard by non-participants is often problematic.•Overlap-based metrics are not suitable for full text comparison-based evaluation.•GRAD exceeds word-based metrics to distinguish between generated and human written summaries.•Overlap metrics and GRAD can identify native abstracts among ones from different texts.•Existing metrics, except GEM, have relative values and so are not interpretable.•The majority of the metrics are normalized, but in practice, their values tend to 0.
The increasing volume of textual information on any topic requires its compression to allow humans to digest it. This implies detecting the most important information and condensing it. These challenges have led to new developments in the area of Natural Language Processing (NLP) and Information Retrieval (IR) such as narrative summarization and evaluation methodologies for narrative extraction. Despite some progress over recent years with several solutions for information extraction and text summarization, the problems of generating consistent narrative summaries and evaluating them are still unresolved. With regard to evaluation, manual assessment is expensive, subjective and not applicable in real time or to large collections. Moreover, it does not provide re-usable benchmarks. Nevertheless, commonly used metrics for summary evaluation still imply substantial human effort since they require a comparison of candidate summaries with a set of reference summaries. The contributions of this paper are three-fold. First, we provide a comprehensive overview of existing metrics for summary evaluation. We discuss several limitations of existing frameworks for summary evaluation. Second, we introduce an automatic framework for the evaluation of metrics that does not require any human annotation. Finally, we evaluate the existing assessment metrics on a Wikipedia data set and a collection of scientific articles using this framework. Our findings show that the majority of existing metrics based on vocabulary overlap are not suitable for assessment based on comparison with a full text and we discuss this outcome. |
---|---|
ISSN: | 0306-4573 1873-5371 |
DOI: | 10.1016/j.ipm.2019.04.001 |