InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling

Cross-lingual topic models have been prevalent for cross-lingual text analysis by revealing aligned latent topics. However, most existing methods suffer from producing repetitive topics that hinder further analysis and performance decline caused by low-coverage dictionaries. In this paper, we propos...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Wu, Xiaobao, Dong, Xinshuai, Nguyen, Thong, Liu, Chaoqun, Pan, Liangming, Luu, Anh Tuan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Wu, Xiaobao
Dong, Xinshuai
Nguyen, Thong
Liu, Chaoqun
Pan, Liangming
Luu, Anh Tuan
description Cross-lingual topic models have been prevalent for cross-lingual text analysis by revealing aligned latent topics. However, most existing methods suffer from producing repetitive topics that hinder further analysis and performance decline caused by low-coverage dictionaries. In this paper, we propose the Cross-lingual Topic Modeling with Mutual Information (InfoCTM). Instead of the direct alignment in previous work, we propose a topic alignment with mutual information method. This works as a regularization to properly align topics and prevent degenerate topic representations of words, which mitigates the repetitive topic issue. To address the low-coverage dictionary issue, we further propose a cross-lingual vocabulary linking method that finds more linked cross-lingual words for topic alignment beyond the translations of a given dictionary. Extensive experiments on English, Chinese, and Japanese datasets demonstrate that our method outperforms state-of-the-art baselines, producing more coherent, diverse, and well-aligned topics and showing better transferability for cross-lingual classification tasks.
doi_str_mv 10.48550/arxiv.2304.03544
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2304_03544</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2304_03544</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-d41b1d2ee5eca46eb5623d8feef2d5a9796d956c74c980a8d3cc05feb1a8d8413</originalsourceid><addsrcrecordid>eNotj7tugzAYhb10iNI-QKb6BaA2vmC6RaiXSKB2QF2RsX9XlgAjQ6K0T9-QdDoX6RzpQ2hHScqVEORJx7M_pRkjPCVMcL5BX4fRhbKpn_Ee18flqHu8NnHQiw8jrvXZD_73Fj4hzhOYxZ8AB4fLGOY5qfz4va6aMHmD62ChvzT36M7pfoaHf92i5vWlKd-T6uPtUO6rRMucJ5bTjtoMQIDRXEInZMascgAus0IXeSFtIaTJuSkU0coyY4hw0NGLV5yyLXq83V7B2in6QcefdgVsr4DsD0zwTGU</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling</title><source>arXiv.org</source><creator>Wu, Xiaobao ; Dong, Xinshuai ; Nguyen, Thong ; Liu, Chaoqun ; Pan, Liangming ; Luu, Anh Tuan</creator><creatorcontrib>Wu, Xiaobao ; Dong, Xinshuai ; Nguyen, Thong ; Liu, Chaoqun ; Pan, Liangming ; Luu, Anh Tuan</creatorcontrib><description>Cross-lingual topic models have been prevalent for cross-lingual text analysis by revealing aligned latent topics. However, most existing methods suffer from producing repetitive topics that hinder further analysis and performance decline caused by low-coverage dictionaries. In this paper, we propose the Cross-lingual Topic Modeling with Mutual Information (InfoCTM). Instead of the direct alignment in previous work, we propose a topic alignment with mutual information method. This works as a regularization to properly align topics and prevent degenerate topic representations of words, which mitigates the repetitive topic issue. To address the low-coverage dictionary issue, we further propose a cross-lingual vocabulary linking method that finds more linked cross-lingual words for topic alignment beyond the translations of a given dictionary. Extensive experiments on English, Chinese, and Japanese datasets demonstrate that our method outperforms state-of-the-art baselines, producing more coherent, diverse, and well-aligned topics and showing better transferability for cross-lingual classification tasks.</description><identifier>DOI: 10.48550/arxiv.2304.03544</identifier><language>eng</language><subject>Computer Science - Computation and Language</subject><creationdate>2023-04</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2304.03544$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2304.03544$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Wu, Xiaobao</creatorcontrib><creatorcontrib>Dong, Xinshuai</creatorcontrib><creatorcontrib>Nguyen, Thong</creatorcontrib><creatorcontrib>Liu, Chaoqun</creatorcontrib><creatorcontrib>Pan, Liangming</creatorcontrib><creatorcontrib>Luu, Anh Tuan</creatorcontrib><title>InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling</title><description>Cross-lingual topic models have been prevalent for cross-lingual text analysis by revealing aligned latent topics. However, most existing methods suffer from producing repetitive topics that hinder further analysis and performance decline caused by low-coverage dictionaries. In this paper, we propose the Cross-lingual Topic Modeling with Mutual Information (InfoCTM). Instead of the direct alignment in previous work, we propose a topic alignment with mutual information method. This works as a regularization to properly align topics and prevent degenerate topic representations of words, which mitigates the repetitive topic issue. To address the low-coverage dictionary issue, we further propose a cross-lingual vocabulary linking method that finds more linked cross-lingual words for topic alignment beyond the translations of a given dictionary. Extensive experiments on English, Chinese, and Japanese datasets demonstrate that our method outperforms state-of-the-art baselines, producing more coherent, diverse, and well-aligned topics and showing better transferability for cross-lingual classification tasks.</description><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj7tugzAYhb10iNI-QKb6BaA2vmC6RaiXSKB2QF2RsX9XlgAjQ6K0T9-QdDoX6RzpQ2hHScqVEORJx7M_pRkjPCVMcL5BX4fRhbKpn_Ee18flqHu8NnHQiw8jrvXZD_73Fj4hzhOYxZ8AB4fLGOY5qfz4va6aMHmD62ChvzT36M7pfoaHf92i5vWlKd-T6uPtUO6rRMucJ5bTjtoMQIDRXEInZMascgAus0IXeSFtIaTJuSkU0coyY4hw0NGLV5yyLXq83V7B2in6QcefdgVsr4DsD0zwTGU</recordid><startdate>20230407</startdate><enddate>20230407</enddate><creator>Wu, Xiaobao</creator><creator>Dong, Xinshuai</creator><creator>Nguyen, Thong</creator><creator>Liu, Chaoqun</creator><creator>Pan, Liangming</creator><creator>Luu, Anh Tuan</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230407</creationdate><title>InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling</title><author>Wu, Xiaobao ; Dong, Xinshuai ; Nguyen, Thong ; Liu, Chaoqun ; Pan, Liangming ; Luu, Anh Tuan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-d41b1d2ee5eca46eb5623d8feef2d5a9796d956c74c980a8d3cc05feb1a8d8413</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Wu, Xiaobao</creatorcontrib><creatorcontrib>Dong, Xinshuai</creatorcontrib><creatorcontrib>Nguyen, Thong</creatorcontrib><creatorcontrib>Liu, Chaoqun</creatorcontrib><creatorcontrib>Pan, Liangming</creatorcontrib><creatorcontrib>Luu, Anh Tuan</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wu, Xiaobao</au><au>Dong, Xinshuai</au><au>Nguyen, Thong</au><au>Liu, Chaoqun</au><au>Pan, Liangming</au><au>Luu, Anh Tuan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling</atitle><date>2023-04-07</date><risdate>2023</risdate><abstract>Cross-lingual topic models have been prevalent for cross-lingual text analysis by revealing aligned latent topics. However, most existing methods suffer from producing repetitive topics that hinder further analysis and performance decline caused by low-coverage dictionaries. In this paper, we propose the Cross-lingual Topic Modeling with Mutual Information (InfoCTM). Instead of the direct alignment in previous work, we propose a topic alignment with mutual information method. This works as a regularization to properly align topics and prevent degenerate topic representations of words, which mitigates the repetitive topic issue. To address the low-coverage dictionary issue, we further propose a cross-lingual vocabulary linking method that finds more linked cross-lingual words for topic alignment beyond the translations of a given dictionary. Extensive experiments on English, Chinese, and Japanese datasets demonstrate that our method outperforms state-of-the-art baselines, producing more coherent, diverse, and well-aligned topics and showing better transferability for cross-lingual classification tasks.</abstract><doi>10.48550/arxiv.2304.03544</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2304.03544
ispartof
issn
language eng
recordid cdi_arxiv_primary_2304_03544
source arXiv.org
subjects Computer Science - Computation and Language
title InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-30T04%3A23%3A27IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=InfoCTM:%20A%20Mutual%20Information%20Maximization%20Perspective%20of%20Cross-Lingual%20Topic%20Modeling&rft.au=Wu,%20Xiaobao&rft.date=2023-04-07&rft_id=info:doi/10.48550/arxiv.2304.03544&rft_dat=%3Carxiv_GOX%3E2304_03544%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true