Excavating the mother lode of human-generated text: A systematic review of research that uses the wikipedia corpus

•Wikipedia provides rich, natural semi-structured texts for information retrieval.•It provides semantic information for keyword extraction from varied texts.•It facilitates clustering, text classification and semantic relatedness analyses.•It supplies a semantically structured knowledge base for stu...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Information processing & management 2017-03, Vol.53 (2), p.505-529
Hauptverfasser:	Mehdi, Mohamad, Okoli, Chitu, Mesgari, Mostafa, Nielsen, Finn Årup, Lanamäki, Arto
Format:	Artikel
Sprache:	eng
Schlagworte:	Data mining Domains Encyclopedias Information extraction Information retrieval Knowledge bases (artificial intelligence) Literature review Natural language processing Ontologies Studies Websites Wikipedia
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•Wikipedia provides rich, natural semi-structured texts for information retrieval.•It provides semantic information for keyword extraction from varied texts.•It facilitates clustering, text classification and semantic relatedness analyses.•It supplies a semantically structured knowledge base for studying ontologies. Although primarily an encyclopedia, Wikipedia’s expansive content provides a knowledge base that has been continuously exploited by researchers in a wide variety of domains. This article systematically reviews the scholarly studies that have used Wikipedia as a data source, and investigates the means by which Wikipedia has been employed in three main computer science research areas: information retrieval, natural language processing, and ontology building. We report and discuss the research trends of the identified and examined studies. We further identify and classify a list of tools that can be used to extract data from Wikipedia, and compile a list of currently available data sets extracted from Wikipedia.
ISSN:	0306-4573 1873-5371
DOI:	10.1016/j.ipm.2016.07.003