How far we can go with extractive text summarization? Heuristic methods to obtain near upper bounds

•We propose 9 heuristics to construct near upper bounds for extractive summarization.•The proposed methods are faster and have close results to the exhaustive search.•We construct the sentence-based and word-based near upper bounds for 5 different corpora.•We evaluate 6 general extractive summarizat...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Expert systems with applications 2017-12, Vol.90, p.439-463
Hauptverfasser: Wang, W.M., Li, Z., Wang, J.W., Zheng, Z.H.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•We propose 9 heuristics to construct near upper bounds for extractive summarization.•The proposed methods are faster and have close results to the exhaustive search.•We construct the sentence-based and word-based near upper bounds for 5 different corpora.•We evaluate 6 general extractive summarization methods across the 5 corpora. Extractive text summarization is an effective way to automatically reduce a text to a summary by selecting a subset of the text. The performance of a summarization system is usually evaluated by comparing with human-constructed extractive summaries that are created in annotated text datasets. However, for datasets where an abstract is written for reader purpose, the performance of a summarization system is evaluated by comparing with an abstract that is created by human who uses his own words. This makes it difficult to determine how far the state-of-the-art extractive methods are away from the upper bound that an ideal extractive method might achieve. In addition, the performance of an extractive method is always different in each domain, which make it difficult to benchmark. Previous studies construct an ideal sentence-based extract of a document that provides the best score of a given metric by exhaustive search of all possible sentence combinations of a given length. They then use the performance of the extract as the sentence-based upper-bound. However, this only applies to short texts. For long texts and multiple documents, previous studies rely on manual effort, which is expensive and time consuming. In this paper, we propose nine fast heuristic methods to generate the near ideal sentence-based extracts for long texts and multiple documents. Furthermore, we propose an n-gram construction method to construct the word-based upper-bound. A percentage ranking method is used to benchmark different extractive methods across different corpora. In the experiments, five different corpora are used. The results show that the near upper bounds constructed by the proposed methods are close to that using exhaustive search, but the proposed methods are much faster. Six general extractive summarization methods were also assessed to demonstrate the difference between the performance of the methods and the near upper bounds.
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2017.08.040