An Analysis of Lemmatization on Topic Models of Morphologically Rich Language
Topic models are typically represented by top-$m$ word lists for human interpretation. The corpus is often pre-processed with lemmatization (or stemming) so that those representations are not undermined by a proliferation of words with similar meanings, but there is little public work on the effects...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Topic models are typically represented by top-$m$ word lists for human
interpretation. The corpus is often pre-processed with lemmatization (or
stemming) so that those representations are not undermined by a proliferation
of words with similar meanings, but there is little public work on the effects
of that pre-processing. Recent work studied the effect of stemming on topic
models of English texts and found no supporting evidence for the practice. We
study the effect of lemmatization on topic models of Russian Wikipedia
articles, finding in one configuration that it significantly improves
interpretability according to a word intrusion metric. We conclude that
lemmatization may benefit topic models on morphologically rich languages, but
that further investigation is needed. |
---|---|
DOI: | 10.48550/arxiv.1608.03995 |