Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD

In a previous article, we presented a systematic computational study of the extraction of semantic representations from the word–word co-occurrence statistics of large text corpora. The conclusion was that semantic vectors of pointwise mutual information values from very small co-occurrence windows,...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Behavior Research Methods 2012-09, Vol.44 (3), p.890-907
Hauptverfasser: Bullinaria, John A., Levy, Joseph P.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In a previous article, we presented a systematic computational study of the extraction of semantic representations from the word–word co-occurrence statistics of large text corpora. The conclusion was that semantic vectors of pointwise mutual information values from very small co-occurrence windows, together with a cosine distance measure, consistently resulted in the best representations across a range of psychologically relevant semantic tasks. This article extends that study by investigating the use of three further factors—namely, the application of stop-lists, word stemming, and dimensionality reduction using singular value decomposition (SVD)—that have been used to provide improved performance elsewhere. It also introduces an additional semantic task and explores the advantages of using a much larger corpus. This leads to the discovery and analysis of improved SVD-based methods for generating semantic representations (that provide new state-of-the-art performance on a standard TOEFL task) and the identification and discussion of problems and misleading results that can arise without a full systematic study.
ISSN:1554-3528
1554-3528
DOI:10.3758/s13428-011-0183-8