Automatic keyphrase extraction using word embeddings

Unsupervised random-walk keyphrase extraction models mainly rely on global structural information of the word graph, with nodes representing candidate words and edges capturing the co-occurrence information between candidate words. However, using word embedding method to integrate multiple kinds of...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Soft computing (Berlin, Germany) Germany), 2020-04, Vol.24 (8), p.5593-5608
Hauptverfasser: Zhang, Yuxiang, Liu, Huan, Wang, Suge, Ip, W. H., Fan, Wei, Xiao, Chunjing
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Unsupervised random-walk keyphrase extraction models mainly rely on global structural information of the word graph, with nodes representing candidate words and edges capturing the co-occurrence information between candidate words. However, using word embedding method to integrate multiple kinds of useful information into the random-walk model to help better extract keyphrases is relatively unexplored. In this paper, we propose a random-walk-based ranking method to extract keyphrases from text documents using word embeddings. Specifically, we first design a heterogeneous text graph embedding model to integrate local context information of the word graph (i.e., the local word collocation patterns) with some crucial features of candidate words and edges of the word graph. Then, a novel random-walk-based ranking model is designed to score candidate words by leveraging such learned word embeddings. Finally, a new and generic similarity-based phrase scoring model using word embeddings is proposed to score phrases for selecting top-scoring phrases as keyphrases. Experimental results show that the proposed method consistently outperforms eight state-of-the-art unsupervised methods on three real datasets for keyphrase extraction.
ISSN:1432-7643
1433-7479
DOI:10.1007/s00500-019-03963-y