Excavating the mother lode of human-generated text: A systematic review of research that uses the wikipedia corpus
•Wikipedia provides rich, natural semi-structured texts for information retrieval.•It provides semantic information for keyword extraction from varied texts.•It facilitates clustering, text classification and semantic relatedness analyses.•It supplies a semantically structured knowledge base for stu...
Gespeichert in:
Veröffentlicht in: | Information processing & management 2017-03, Vol.53 (2), p.505-529 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | •Wikipedia provides rich, natural semi-structured texts for information retrieval.•It provides semantic information for keyword extraction from varied texts.•It facilitates clustering, text classification and semantic relatedness analyses.•It supplies a semantically structured knowledge base for studying ontologies.
Although primarily an encyclopedia, Wikipedia’s expansive content provides a knowledge base that has been continuously exploited by researchers in a wide variety of domains. This article systematically reviews the scholarly studies that have used Wikipedia as a data source, and investigates the means by which Wikipedia has been employed in three main computer science research areas: information retrieval, natural language processing, and ontology building. We report and discuss the research trends of the identified and examined studies. We further identify and classify a list of tools that can be used to extract data from Wikipedia, and compile a list of currently available data sets extracted from Wikipedia. |
---|---|
ISSN: | 0306-4573 1873-5371 |
DOI: | 10.1016/j.ipm.2016.07.003 |