General-use unsupervised keyword extraction model for keyword analysis

Keyword extraction is the foundation for solving various text mining tasks. However, the literature heavily relies on statistical, linguistic feature-based, or graph-based metrics to gauge corpus-representative keywords, the process of which is sensitive to preprocessing and stopword selection. In t...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Expert systems with applications 2023-12, Vol.233, p.120889, Article 120889
Hauptverfasser:	Shin, Hunsik, Lee, Hye Jin, Cho, Sungzoon
Format:	Artikel
Sprache:	eng
Schlagworte:	Data mining General-use model Keyword analysis Keyword extraction Text-mining
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Keyword extraction is the foundation for solving various text mining tasks. However, the literature heavily relies on statistical, linguistic feature-based, or graph-based metrics to gauge corpus-representative keywords, the process of which is sensitive to preprocessing and stopword selection. In this paper, we propose a general-use keyword extraction model designed to work with document groups of various sizes, domains, and readability, as well as the existence of keyword labels. To extract a better selection of keywords, we employ a simple logistic regression model with the least absolute shrinkage and selection operator regularization (Tibshirani, 1996). The classification-based structure of our approach ensures learning words that distinctively characterize the given document group against the comparison groups, enhancing the representativeness of the extracted keywords. Furthermore, our model repeatedly modifies coefficients as it learns the document label classifiers, rather than relying directly on the term frequencies, reducing the model’s sensitivity to words of very high and very low frequencies. We test our model’s performance against numerous classic keyword extraction frameworks as baseline models using online customer reviews, news articles, and patent documentation. The results indicate that our proposed method has robust performance in terms of representability and distinctiveness across document groups with varying sizes, number of class labels, levels of readability, and domain. Additionally, we show that our model beats baseline models even when applied to documents without class labels compared with the baseline models. Given its generalizability and simplicity, we believe that our proposed model may serve as an easy-to-use, yet a powerful, general-use tool for keyword extraction, especially when working with various groups of documents from different domains. •We propose an all-purpose keyword extraction model without requiring keyword label.•Our model is scalable to corpora of any sizes and types with competitive performance.•Our method may serve as an easy-to-use yet powerful tool for keyword extraction.
ISSN:	0957-4174 1873-6793
DOI:	10.1016/j.eswa.2023.120889