Sys-TM: A Fast and General Topic Modeling System

Topic models, such as LDA and its variants, are popular probabilistic models for discovering the abstract "topics" that occur in a collection of documents. However, the performance of topic models may vary a lot for different workloads, and it is not a trivial task to achieve a well-optimi...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on knowledge and data engineering 2021-06, Vol.33 (6), p.2790-2802
Hauptverfasser: Shao, Yingxia, Li, Xupeng, Chen, Yiru, Yu, Lele, Cui, Bin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Topic models, such as LDA and its variants, are popular probabilistic models for discovering the abstract "topics" that occur in a collection of documents. However, the performance of topic models may vary a lot for different workloads, and it is not a trivial task to achieve a well-optimized implementation. In this paper, we systematically study all recently proposed samplers over LDA: AliasLDA, F+LDA, LightLDA, and WarpLDA, and discover a novel system tradeoff by considering the diversity and skewness of workloads. Then, we propose a hybrid sampler which can cleverly choose an efficient sampler with the tradeoff, and apply the hybrid sampler to LDA and its variants, including STM, TOT and CTM. Finally, we build a fast and general topic modeling system Sys-TM, which provides a unified topic modeling framework by integrating the hybrid sampler. Based on our empirical studies, the hybrid sampler outperforms the state-of-the-art samplers by up to 2\times 2× over various topic models, and with carefully engineered implementation, Sys-TM is able to outperform the existing systems by up to 10\times 10× .
ISSN:1041-4347
1558-2191
DOI:10.1109/TKDE.2019.2956518