An unsupervised method for extractive multi-document summarization based on centroid approach and sentence embeddings

Extractive multi-document summarization (MDS) is the process of automatically summarizing a collection of documents by ranking sentences according to their importance and informativeness. Text representation is a fundamental process that affects the effectiveness of many text summarization methods....

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Expert systems with applications 2021-04, Vol.167, p.114152, Article 114152
Hauptverfasser:	Lamsiyah, Salima, El Mahdaouy, Abdelkader, Espinasse, Bernard, El Alaoui Ouatik, Saïd
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial Intelligence Centroid approach Centroids Computation and Language Computer Science Datasets Embedding Extractive text summarization Methods Natural language processing Redundancy Representations Semantics Sentence embeddings Sentences Transfer learning Word embeddings
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page	114152
container_title	Expert systems with applications
container_volume	167
creator	Lamsiyah, Salima El Mahdaouy, Abdelkader Espinasse, Bernard El Alaoui Ouatik, Saïd
description	Extractive multi-document summarization (MDS) is the process of automatically summarizing a collection of documents by ranking sentences according to their importance and informativeness. Text representation is a fundamental process that affects the effectiveness of many text summarization methods. Word embedding representations have shown to be effective for several Natural Language Processing (NLP) tasks including Automatic Text Summarization (ATS). However, most of these representations do not consider the order and the semantic relationships between words in a sentence. This does not fully allow grasping the sentence semantics and the syntactic relationships between sentences constituents. In this paper, to overcome this problem, we propose an unsupervised method for generic extractive multi-document summarization based on the sentence embedding representations and the centroid approach. The proposed method selects relevant sentences according to the final score obtained by combining three scores: sentence content relevance, sentence novelty, and sentence position scores. The sentence content relevance score is computed as the cosine similarity between the centroid embedding vector of the cluster of documents and the sentence embedding vectors. The sentence novelty metric is explicitly adopted to deal with redundancy. The sentence position metric assumes that the first sentences of a document are more relevant to the summary, and it assigns high scores to these sentences. Moreover, this paper provides a comparative analysis of nine sentence embedding models used to represent sentences as dense vectors in a low dimensional vector space in the context of extractive multi-document summarization. Experiments are performed on the standard DUC’2002–2004 benchmark datasets and the Multi-News dataset. The overall obtained results have shown that our method outperforms several state-of-the-art methods and achieves promising results compared to the best performing methods including supervised deep learning based methods. •An unsupervised method for extractive multi-document summarization.•Pre-trained sentence embedding models are used for sentences representations.•Centroid approach is applied to compute the sentence content relevance score.•Sentence selection based on sentence relevance, novelty and position scores.•The use of sentence embedding methods leads to significant improvements.
doi_str_mv	10.1016/j.eswa.2020.114152
format	Article
fullrecord	<record><control><sourceid>proquest_hal_p</sourceid><recordid>TN_cdi_hal_primary_oai_HAL_hal_03604013v1</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0957417420308952</els_id><sourcerecordid>2503172547</sourcerecordid><originalsourceid>FETCH-LOGICAL-c406t-5688617185f3b6fd85c1a8c49ff0d5ed9fc3dd126ef78efe2a860e82af143ee93</originalsourceid><addsrcrecordid>eNp9UU1rGzEQFaWBumn-QE6CnnJYVx-7khZyMSFpAoZe2rOQpVEs4125ktZN8uurZUuOPc3w5r3hzTyErilZU0LFt8Ma8h-zZoRVgLa0Yx_QiirJGyF7_hGtSN_JpqWy_YQ-53wghEpC5ApNmxFPY55OkM4hg8MDlH102MeE4aUkY0s4Ax6mYwmNi3YaYCw4T8NgUngzJcQR78wsrI2tsxSDw-Z0StHYPTajw7miMFrAMOzAuTA-5y_owptjhqt_9RL9erj_effYbH98f7rbbBvbElGaTiglqKSq83wnvFOdpUbZtveeuA5c7y13jjIBXirwwIwSBBQznrYcoOeX6GbZuzdHfUqhmn7V0QT9uNnqGSNckJZQfqaV-3XhVuu_J8hFH-KUxmpPs45wKlnXyspiC8ummHMC_76WEj1HoQ96jkLPUegliiq6XURQbz0HSDrbML_EhQS2aBfD_-R_AVNglC0</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2503172547</pqid></control><display><type>article</type><title>An unsupervised method for extractive multi-document summarization based on centroid approach and sentence embeddings</title><source>Access via ScienceDirect (Elsevier)</source><creator>Lamsiyah, Salima ; El Mahdaouy, Abdelkader ; Espinasse, Bernard ; El Alaoui Ouatik, Saïd</creator><creatorcontrib>Lamsiyah, Salima ; El Mahdaouy, Abdelkader ; Espinasse, Bernard ; El Alaoui Ouatik, Saïd</creatorcontrib><description>Extractive multi-document summarization (MDS) is the process of automatically summarizing a collection of documents by ranking sentences according to their importance and informativeness. Text representation is a fundamental process that affects the effectiveness of many text summarization methods. Word embedding representations have shown to be effective for several Natural Language Processing (NLP) tasks including Automatic Text Summarization (ATS). However, most of these representations do not consider the order and the semantic relationships between words in a sentence. This does not fully allow grasping the sentence semantics and the syntactic relationships between sentences constituents. In this paper, to overcome this problem, we propose an unsupervised method for generic extractive multi-document summarization based on the sentence embedding representations and the centroid approach. The proposed method selects relevant sentences according to the final score obtained by combining three scores: sentence content relevance, sentence novelty, and sentence position scores. The sentence content relevance score is computed as the cosine similarity between the centroid embedding vector of the cluster of documents and the sentence embedding vectors. The sentence novelty metric is explicitly adopted to deal with redundancy. The sentence position metric assumes that the first sentences of a document are more relevant to the summary, and it assigns high scores to these sentences. Moreover, this paper provides a comparative analysis of nine sentence embedding models used to represent sentences as dense vectors in a low dimensional vector space in the context of extractive multi-document summarization. Experiments are performed on the standard DUC’2002–2004 benchmark datasets and the Multi-News dataset. The overall obtained results have shown that our method outperforms several state-of-the-art methods and achieves promising results compared to the best performing methods including supervised deep learning based methods. •An unsupervised method for extractive multi-document summarization.•Pre-trained sentence embedding models are used for sentences representations.•Centroid approach is applied to compute the sentence content relevance score.•Sentence selection based on sentence relevance, novelty and position scores.•The use of sentence embedding methods leads to significant improvements.</description><identifier>ISSN: 0957-4174</identifier><identifier>EISSN: 1873-6793</identifier><identifier>DOI: 10.1016/j.eswa.2020.114152</identifier><language>eng</language><publisher>New York: Elsevier Ltd</publisher><subject>Artificial Intelligence ; Centroid approach ; Centroids ; Computation and Language ; Computer Science ; Datasets ; Embedding ; Extractive text summarization ; Methods ; Natural language processing ; Redundancy ; Representations ; Semantics ; Sentence embeddings ; Sentences ; Transfer learning ; Word embeddings</subject><ispartof>Expert systems with applications, 2021-04, Vol.167, p.114152, Article 114152</ispartof><rights>2020 Elsevier Ltd</rights><rights>Copyright Elsevier BV Apr 1, 2021</rights><rights>Distributed under a Creative Commons Attribution 4.0 International License</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c406t-5688617185f3b6fd85c1a8c49ff0d5ed9fc3dd126ef78efe2a860e82af143ee93</citedby><cites>FETCH-LOGICAL-c406t-5688617185f3b6fd85c1a8c49ff0d5ed9fc3dd126ef78efe2a860e82af143ee93</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.eswa.2020.114152$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>230,314,780,784,885,3550,27924,27925,45995</link.rule.ids><backlink>$$Uhttps://hal.science/hal-03604013$$DView record in HAL$$Hfree_for_read</backlink></links><search><creatorcontrib>Lamsiyah, Salima</creatorcontrib><creatorcontrib>El Mahdaouy, Abdelkader</creatorcontrib><creatorcontrib>Espinasse, Bernard</creatorcontrib><creatorcontrib>El Alaoui Ouatik, Saïd</creatorcontrib><title>An unsupervised method for extractive multi-document summarization based on centroid approach and sentence embeddings</title><title>Expert systems with applications</title><description>Extractive multi-document summarization (MDS) is the process of automatically summarizing a collection of documents by ranking sentences according to their importance and informativeness. Text representation is a fundamental process that affects the effectiveness of many text summarization methods. Word embedding representations have shown to be effective for several Natural Language Processing (NLP) tasks including Automatic Text Summarization (ATS). However, most of these representations do not consider the order and the semantic relationships between words in a sentence. This does not fully allow grasping the sentence semantics and the syntactic relationships between sentences constituents. In this paper, to overcome this problem, we propose an unsupervised method for generic extractive multi-document summarization based on the sentence embedding representations and the centroid approach. The proposed method selects relevant sentences according to the final score obtained by combining three scores: sentence content relevance, sentence novelty, and sentence position scores. The sentence content relevance score is computed as the cosine similarity between the centroid embedding vector of the cluster of documents and the sentence embedding vectors. The sentence novelty metric is explicitly adopted to deal with redundancy. The sentence position metric assumes that the first sentences of a document are more relevant to the summary, and it assigns high scores to these sentences. Moreover, this paper provides a comparative analysis of nine sentence embedding models used to represent sentences as dense vectors in a low dimensional vector space in the context of extractive multi-document summarization. Experiments are performed on the standard DUC’2002–2004 benchmark datasets and the Multi-News dataset. The overall obtained results have shown that our method outperforms several state-of-the-art methods and achieves promising results compared to the best performing methods including supervised deep learning based methods. •An unsupervised method for extractive multi-document summarization.•Pre-trained sentence embedding models are used for sentences representations.•Centroid approach is applied to compute the sentence content relevance score.•Sentence selection based on sentence relevance, novelty and position scores.•The use of sentence embedding methods leads to significant improvements.</description><subject>Artificial Intelligence</subject><subject>Centroid approach</subject><subject>Centroids</subject><subject>Computation and Language</subject><subject>Computer Science</subject><subject>Datasets</subject><subject>Embedding</subject><subject>Extractive text summarization</subject><subject>Methods</subject><subject>Natural language processing</subject><subject>Redundancy</subject><subject>Representations</subject><subject>Semantics</subject><subject>Sentence embeddings</subject><subject>Sentences</subject><subject>Transfer learning</subject><subject>Word embeddings</subject><issn>0957-4174</issn><issn>1873-6793</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><recordid>eNp9UU1rGzEQFaWBumn-QE6CnnJYVx-7khZyMSFpAoZe2rOQpVEs4125ktZN8uurZUuOPc3w5r3hzTyErilZU0LFt8Ma8h-zZoRVgLa0Yx_QiirJGyF7_hGtSN_JpqWy_YQ-53wghEpC5ApNmxFPY55OkM4hg8MDlH102MeE4aUkY0s4Ax6mYwmNi3YaYCw4T8NgUngzJcQR78wsrI2tsxSDw-Z0StHYPTajw7miMFrAMOzAuTA-5y_owptjhqt_9RL9erj_effYbH98f7rbbBvbElGaTiglqKSq83wnvFOdpUbZtveeuA5c7y13jjIBXirwwIwSBBQznrYcoOeX6GbZuzdHfUqhmn7V0QT9uNnqGSNckJZQfqaV-3XhVuu_J8hFH-KUxmpPs45wKlnXyspiC8ummHMC_76WEj1HoQ96jkLPUegliiq6XURQbz0HSDrbML_EhQS2aBfD_-R_AVNglC0</recordid><startdate>20210401</startdate><enddate>20210401</enddate><creator>Lamsiyah, Salima</creator><creator>El Mahdaouy, Abdelkader</creator><creator>Espinasse, Bernard</creator><creator>El Alaoui Ouatik, Saïd</creator><general>Elsevier Ltd</general><general>Elsevier BV</general><general>Elsevier</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>1XC</scope></search><sort><creationdate>20210401</creationdate><title>An unsupervised method for extractive multi-document summarization based on centroid approach and sentence embeddings</title><author>Lamsiyah, Salima ; El Mahdaouy, Abdelkader ; Espinasse, Bernard ; El Alaoui Ouatik, Saïd</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c406t-5688617185f3b6fd85c1a8c49ff0d5ed9fc3dd126ef78efe2a860e82af143ee93</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Artificial Intelligence</topic><topic>Centroid approach</topic><topic>Centroids</topic><topic>Computation and Language</topic><topic>Computer Science</topic><topic>Datasets</topic><topic>Embedding</topic><topic>Extractive text summarization</topic><topic>Methods</topic><topic>Natural language processing</topic><topic>Redundancy</topic><topic>Representations</topic><topic>Semantics</topic><topic>Sentence embeddings</topic><topic>Sentences</topic><topic>Transfer learning</topic><topic>Word embeddings</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Lamsiyah, Salima</creatorcontrib><creatorcontrib>El Mahdaouy, Abdelkader</creatorcontrib><creatorcontrib>Espinasse, Bernard</creatorcontrib><creatorcontrib>El Alaoui Ouatik, Saïd</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Hyper Article en Ligne (HAL)</collection><jtitle>Expert systems with applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Lamsiyah, Salima</au><au>El Mahdaouy, Abdelkader</au><au>Espinasse, Bernard</au><au>El Alaoui Ouatik, Saïd</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>An unsupervised method for extractive multi-document summarization based on centroid approach and sentence embeddings</atitle><jtitle>Expert systems with applications</jtitle><date>2021-04-01</date><risdate>2021</risdate><volume>167</volume><spage>114152</spage><pages>114152-</pages><artnum>114152</artnum><issn>0957-4174</issn><eissn>1873-6793</eissn><abstract>Extractive multi-document summarization (MDS) is the process of automatically summarizing a collection of documents by ranking sentences according to their importance and informativeness. Text representation is a fundamental process that affects the effectiveness of many text summarization methods. Word embedding representations have shown to be effective for several Natural Language Processing (NLP) tasks including Automatic Text Summarization (ATS). However, most of these representations do not consider the order and the semantic relationships between words in a sentence. This does not fully allow grasping the sentence semantics and the syntactic relationships between sentences constituents. In this paper, to overcome this problem, we propose an unsupervised method for generic extractive multi-document summarization based on the sentence embedding representations and the centroid approach. The proposed method selects relevant sentences according to the final score obtained by combining three scores: sentence content relevance, sentence novelty, and sentence position scores. The sentence content relevance score is computed as the cosine similarity between the centroid embedding vector of the cluster of documents and the sentence embedding vectors. The sentence novelty metric is explicitly adopted to deal with redundancy. The sentence position metric assumes that the first sentences of a document are more relevant to the summary, and it assigns high scores to these sentences. Moreover, this paper provides a comparative analysis of nine sentence embedding models used to represent sentences as dense vectors in a low dimensional vector space in the context of extractive multi-document summarization. Experiments are performed on the standard DUC’2002–2004 benchmark datasets and the Multi-News dataset. The overall obtained results have shown that our method outperforms several state-of-the-art methods and achieves promising results compared to the best performing methods including supervised deep learning based methods. •An unsupervised method for extractive multi-document summarization.•Pre-trained sentence embedding models are used for sentences representations.•Centroid approach is applied to compute the sentence content relevance score.•Sentence selection based on sentence relevance, novelty and position scores.•The use of sentence embedding methods leads to significant improvements.</abstract><cop>New York</cop><pub>Elsevier Ltd</pub><doi>10.1016/j.eswa.2020.114152</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0957-4174
ispartof	Expert systems with applications, 2021-04, Vol.167, p.114152, Article 114152
issn	0957-4174 1873-6793
language	eng
recordid	cdi_hal_primary_oai_HAL_hal_03604013v1
source	Access via ScienceDirect (Elsevier)
subjects	Artificial Intelligence Centroid approach Centroids Computation and Language Computer Science Datasets Embedding Extractive text summarization Methods Natural language processing Redundancy Representations Semantics Sentence embeddings Sentences Transfer learning Word embeddings
title	An unsupervised method for extractive multi-document summarization based on centroid approach and sentence embeddings
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-20T01%3A27%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_hal_p&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=An%20unsupervised%20method%20for%20extractive%20multi-document%20summarization%20based%20on%20centroid%20approach%20and%20sentence%20embeddings&rft.jtitle=Expert%20systems%20with%20applications&rft.au=Lamsiyah,%20Salima&rft.date=2021-04-01&rft.volume=167&rft.spage=114152&rft.pages=114152-&rft.artnum=114152&rft.issn=0957-4174&rft.eissn=1873-6793&rft_id=info:doi/10.1016/j.eswa.2020.114152&rft_dat=%3Cproquest_hal_p%3E2503172547%3C/proquest_hal_p%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2503172547&rft_id=info:pmid/&rft_els_id=S0957417420308952&rfr_iscdi=true