Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD
In a previous article, we presented a systematic computational study of the extraction of semantic representations from the word–word co-occurrence statistics of large text corpora. The conclusion was that semantic vectors of pointwise mutual information values from very small co-occurrence windows,...
Gespeichert in:
Veröffentlicht in: | Behavior Research Methods 2012-09, Vol.44 (3), p.890-907 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In a previous article, we presented a systematic computational study of the extraction of semantic representations from the word–word co-occurrence statistics of large text corpora. The conclusion was that semantic vectors of pointwise mutual information values from very small co-occurrence windows, together with a cosine distance measure, consistently resulted in the best representations across a range of psychologically relevant semantic tasks. This article extends that study by investigating the use of three further factors—namely, the application of stop-lists, word stemming, and dimensionality reduction using singular value decomposition (SVD)—that have been used to provide improved performance elsewhere. It also introduces an additional semantic task and explores the advantages of using a much larger corpus. This leads to the discovery and analysis of improved SVD-based methods for generating semantic representations (that provide new state-of-the-art performance on a standard TOEFL task) and the identification and discussion of problems and misleading results that can arise without a full systematic study. |
---|---|
ISSN: | 1554-3528 1554-3528 |
DOI: | 10.3758/s13428-011-0183-8 |