SummCoder: An unsupervised framework for extractive text summarization based on deep auto-encoders

•An unsupervised text summarization framework based on deep neural networks.•Vector representation of sentences using recurrent neural networks.•Summary generated using three sentence features relevance, novelty and position.•Deep auto-encoders are exploited for computing sentence content relevance....

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Expert systems with applications 2019-09, Vol.129, p.200-215
Hauptverfasser: Joshi, Akanksha, Fidalgo, E., Alegre, E., Fernández-Robles, Laura
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•An unsupervised text summarization framework based on deep neural networks.•Vector representation of sentences using recurrent neural networks.•Summary generated using three sentence features relevance, novelty and position.•Deep auto-encoders are exploited for computing sentence content relevance.•A new text summarization dataset is introduced from darknet domains. In this paper, we propose SummCoder, a novel methodology for generic extractive text summarization of single documents. The approach generates a summary according to three sentence selection metrics formulated by us: sentence content relevance, sentence novelty, and sentence position relevance. The sentence content relevance is measured using a deep auto-encoder network, and the novelty metric is derived by exploiting the similarity among sentences represented as embeddings in a distributed semantic space. The sentence position relevance metric is a hand-designed feature, which assigns more weight to the first few sentences through a dynamic weight calculation function regulated by the document length. Furthermore, a sentence ranking and a selection technique are developed to generate the document summary by ranking the sentences according to the final score obtained through the fusion of the three sentences selection metrics. We also introduce a new summarization benchmark, Tor Illegal Documents Summarization (TIDSumm) dataset, especially to assist Law Enforcement Agencies (LEAs), that contains two sets of ground truth summaries, manually created, for 100 web documents extracted from onion websites in Tor (The Onion Router) network. Empirical results show that, on DUC 2002, on Blog Summarization, and on TIDSumm datasets, our text summarization approach obtains comparable or better performance than the state-of-the-art methods for different ROUGE metrics.
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2019.03.045