Automatic keyphrase annotation of scientific documents using Wikipedia and genetic algorithms

Topical annotation of documents with keyphrases is a proven method for revealing the subject of scientific and research documents to both human readers and information retrieval systems. This article describes a machine learning-based keyphrase annotation method for scientific documents that utilize...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of information science 2013-06, Vol.39 (3), p.410-426
Hauptverfasser:	Joorabchi, Arash, Mahdi, Abdulhussain E.
Format:	Artikel
Sprache:	eng
Schlagworte:	Annotations Artificial intelligence Consistency Encyclopaedias Exact sciences and technology Filtering Filtration General aspects Genetic algorithms Human Information and communication sciences Information retrieval Information science Information science. Documentation Machine learning Readers Sciences and techniques of general use Semantics Studies Thesauri Websites Wikis
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Topical annotation of documents with keyphrases is a proven method for revealing the subject of scientific and research documents to both human readers and information retrieval systems. This article describes a machine learning-based keyphrase annotation method for scientific documents that utilizes Wikipedia as a thesaurus for candidate selection from documents’ content. We have devised a set of 20 statistical, positional and semantical features for candidate phrases to capture and reflect various properties of those candidates that have the highest keyphraseness probability. We first introduce a simple unsupervised method for ranking and filtering the most probable keyphrases, and then evolve it into a novel supervised method using genetic algorithms. We have evaluated the performance of both methods on a third-party dataset of research papers. Reported experimental results show that the performance of our proposed methods, measured in terms of consistency with human annotators, is on a par with that achieved by humans and outperforms rival supervised and unsupervised methods.
ISSN:	0165-5515 1741-6485
DOI:	10.1177/0165551512472138