InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling
Cross-lingual topic models have been prevalent for cross-lingual text analysis by revealing aligned latent topics. However, most existing methods suffer from producing repetitive topics that hinder further analysis and performance decline caused by low-coverage dictionaries. In this paper, we propos...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Cross-lingual topic models have been prevalent for cross-lingual text
analysis by revealing aligned latent topics. However, most existing methods
suffer from producing repetitive topics that hinder further analysis and
performance decline caused by low-coverage dictionaries. In this paper, we
propose the Cross-lingual Topic Modeling with Mutual Information (InfoCTM).
Instead of the direct alignment in previous work, we propose a topic alignment
with mutual information method. This works as a regularization to properly
align topics and prevent degenerate topic representations of words, which
mitigates the repetitive topic issue. To address the low-coverage dictionary
issue, we further propose a cross-lingual vocabulary linking method that finds
more linked cross-lingual words for topic alignment beyond the translations of
a given dictionary. Extensive experiments on English, Chinese, and Japanese
datasets demonstrate that our method outperforms state-of-the-art baselines,
producing more coherent, diverse, and well-aligned topics and showing better
transferability for cross-lingual classification tasks. |
---|---|
DOI: | 10.48550/arxiv.2304.03544 |