Word Embeddings as Statistical Estimators

Word embeddings are a fundamental tool in natural language processing. Currently, word embedding methods are evaluated on the basis of empirical performance on benchmark data sets, and there is a lack of rigorous understanding of their theoretical properties. This paper studies word embeddings from...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Sankhyā. Series B (2008) 2024-11, Vol.86 (2), p.415-441
Hauptverfasser: Dey, Neil, Singer, Matthew, Williams, Jonathan P., Sengupta, Srijan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Word embeddings are a fundamental tool in natural language processing. Currently, word embedding methods are evaluated on the basis of empirical performance on benchmark data sets, and there is a lack of rigorous understanding of their theoretical properties. This paper studies word embeddings from a statistical theoretical perspective, which is essential for formal inference and uncertainty quantification. We propose a copula-based statistical model for text data and show that under this model, the now-classical Word2Vec method can be interpreted as a statistical estimation method for estimating the theoretical pointwise mutual information (PMI). We further illustrate the utility of this statistical model by using it to develop a missing value-based estimator as a statistically tractable and interpretable alternative to the Word2Vec approach. The estimation error of this estimator is comparable to Word2Vec and improves upon the truncation-based method proposed by Levy and Goldberg ( Adv. Neural Inf. Process. Syst. , 27 , 2177–2185 2014 ). The resulting estimator also is comparable to Word2Vec in a benchmark sentiment analysis task on the IMDb Movie Reviews data set and a part-of-speech tagging task on the OntoNotes data set.
ISSN:0976-8386
0976-8394
DOI:10.1007/s13571-024-00331-1