SummCoder: An unsupervised framework for extractive text summarization based on deep auto-encoders
•An unsupervised text summarization framework based on deep neural networks.•Vector representation of sentences using recurrent neural networks.•Summary generated using three sentence features relevance, novelty and position.•Deep auto-encoders are exploited for computing sentence content relevance....
Gespeichert in:
Veröffentlicht in: | Expert systems with applications 2019-09, Vol.129, p.200-215 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | •An unsupervised text summarization framework based on deep neural networks.•Vector representation of sentences using recurrent neural networks.•Summary generated using three sentence features relevance, novelty and position.•Deep auto-encoders are exploited for computing sentence content relevance.•A new text summarization dataset is introduced from darknet domains.
In this paper, we propose SummCoder, a novel methodology for generic extractive text summarization of single documents. The approach generates a summary according to three sentence selection metrics formulated by us: sentence content relevance, sentence novelty, and sentence position relevance. The sentence content relevance is measured using a deep auto-encoder network, and the novelty metric is derived by exploiting the similarity among sentences represented as embeddings in a distributed semantic space. The sentence position relevance metric is a hand-designed feature, which assigns more weight to the first few sentences through a dynamic weight calculation function regulated by the document length. Furthermore, a sentence ranking and a selection technique are developed to generate the document summary by ranking the sentences according to the final score obtained through the fusion of the three sentences selection metrics. We also introduce a new summarization benchmark, Tor Illegal Documents Summarization (TIDSumm) dataset, especially to assist Law Enforcement Agencies (LEAs), that contains two sets of ground truth summaries, manually created, for 100 web documents extracted from onion websites in Tor (The Onion Router) network. Empirical results show that, on DUC 2002, on Blog Summarization, and on TIDSumm datasets, our text summarization approach obtains comparable or better performance than the state-of-the-art methods for different ROUGE metrics. |
---|---|
ISSN: | 0957-4174 1873-6793 |
DOI: | 10.1016/j.eswa.2019.03.045 |