A Gamma-Poisson Mixture Topic Model for Short Text

Most topic models are constructed under the assumption that documents follow a multinomial distribution. The Poisson distribution is an alternative distribution to describe the probability of count data. For topic modelling, the Poisson distribution describes the number of occurrences of a word in d...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Mathematical problems in engineering 2020, Vol.2020 (2020), p.1-17
Hauptverfasser: Mazarura, Jocelyn, de Villiers, Pieter, de Waal, Alta
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 17
container_issue 2020
container_start_page 1
container_title Mathematical problems in engineering
container_volume 2020
creator Mazarura, Jocelyn
de Villiers, Pieter
de Waal, Alta
description Most topic models are constructed under the assumption that documents follow a multinomial distribution. The Poisson distribution is an alternative distribution to describe the probability of count data. For topic modelling, the Poisson distribution describes the number of occurrences of a word in documents of fixed length. The Poisson distribution has been successfully applied in text classification, but its application to topic modelling is not well documented, specifically in the context of a generative probabilistic model. Furthermore, the few Poisson topic models in the literature are admixture models, making the assumption that a document is generated from a mixture of topics. In this study, we focus on short text. Many studies have shown that the simpler assumption of a mixture model fits short text better. With mixture models, as opposed to admixture models, the generative assumption is that a document is generated from a single topic. One topic model, which makes this one-topic-per-document assumption, is the Dirichlet-multinomial mixture model. The main contributions of this work are a new Gamma-Poisson mixture model, as well as a collapsed Gibbs sampler for the model. The benefit of the collapsed Gibbs sampler derivation is that the model is able to automatically select the number of topics contained in the corpus. The results show that the Gamma-Poisson mixture model performs better than the Dirichlet-multinomial mixture model at selecting the number of topics in labelled corpora. Furthermore, the Gamma-Poisson mixture produces better topic coherence scores than the Dirichlet-multinomial mixture model, thus making it a viable option for the challenging task of topic modelling of short text.
doi_str_mv 10.1155/2020/4728095
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2400274512</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2400274512</sourcerecordid><originalsourceid>FETCH-LOGICAL-c360t-72527a2abf6d9d6a71b11548696024dff271f1ed9d37131e3ad4b306b490cbd63</originalsourceid><addsrcrecordid>eNqFkM1LAzEQxYMoWKs3z7LgUddm8rl7LEWr0KJgBW8hu0nolrapyS7W_96ULXj0NAPvx8x7D6FrwA8AnI8IJnjEJClwyU_QALigOQcmT9OOCcuB0M9zdBHjCmMCHIoBIuNsqjcbnb_5Jka_zebNvu2CzRZ-19TZ3Bu7zpwP2fvShzZb2H17ic6cXkd7dZxD9PH0uJg857PX6ctkPMtrKnCbS8KJ1ERXTpjSCC2hSiZZIUqRvBjniAQHNmlUAgVLtWEVxaJiJa4rI-gQ3fZ3d8F_dTa2auW7sE0vFWEpgGQ8BRqi-56qg48xWKd2odno8KMAq0Mr6tCKOraS8LseXzZbo7-b_-ibnraJsU7_0ZDUAugvJaBolw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2400274512</pqid></control><display><type>article</type><title>A Gamma-Poisson Mixture Topic Model for Short Text</title><source>Wiley Online Library Open Access</source><source>EZB-FREE-00999 freely available EZB journals</source><source>Alma/SFX Local Collection</source><creator>Mazarura, Jocelyn ; de Villiers, Pieter ; de Waal, Alta</creator><contributor>Marques, Filipe J. ; Filipe J Marques</contributor><creatorcontrib>Mazarura, Jocelyn ; de Villiers, Pieter ; de Waal, Alta ; Marques, Filipe J. ; Filipe J Marques</creatorcontrib><description>Most topic models are constructed under the assumption that documents follow a multinomial distribution. The Poisson distribution is an alternative distribution to describe the probability of count data. For topic modelling, the Poisson distribution describes the number of occurrences of a word in documents of fixed length. The Poisson distribution has been successfully applied in text classification, but its application to topic modelling is not well documented, specifically in the context of a generative probabilistic model. Furthermore, the few Poisson topic models in the literature are admixture models, making the assumption that a document is generated from a mixture of topics. In this study, we focus on short text. Many studies have shown that the simpler assumption of a mixture model fits short text better. With mixture models, as opposed to admixture models, the generative assumption is that a document is generated from a single topic. One topic model, which makes this one-topic-per-document assumption, is the Dirichlet-multinomial mixture model. The main contributions of this work are a new Gamma-Poisson mixture model, as well as a collapsed Gibbs sampler for the model. The benefit of the collapsed Gibbs sampler derivation is that the model is able to automatically select the number of topics contained in the corpus. The results show that the Gamma-Poisson mixture model performs better than the Dirichlet-multinomial mixture model at selecting the number of topics in labelled corpora. Furthermore, the Gamma-Poisson mixture produces better topic coherence scores than the Dirichlet-multinomial mixture model, thus making it a viable option for the challenging task of topic modelling of short text.</description><identifier>ISSN: 1024-123X</identifier><identifier>EISSN: 1563-5147</identifier><identifier>DOI: 10.1155/2020/4728095</identifier><language>eng</language><publisher>Cairo, Egypt: Hindawi Publishing Corporation</publisher><subject>Admixtures ; Binomial distribution ; Dirichlet problem ; Modelling ; Poisson distribution ; Probabilistic models ; Product reviews ; Researchers ; Statistical analysis</subject><ispartof>Mathematical problems in engineering, 2020, Vol.2020 (2020), p.1-17</ispartof><rights>Copyright © 2020 Jocelyn Mazarura et al.</rights><rights>Copyright © 2020 Jocelyn Mazarura et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. http://creativecommons.org/licenses/by/4.0</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c360t-72527a2abf6d9d6a71b11548696024dff271f1ed9d37131e3ad4b306b490cbd63</citedby><cites>FETCH-LOGICAL-c360t-72527a2abf6d9d6a71b11548696024dff271f1ed9d37131e3ad4b306b490cbd63</cites><orcidid>0000-0003-4598-0834</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,4010,27900,27901,27902</link.rule.ids></links><search><contributor>Marques, Filipe J.</contributor><contributor>Filipe J Marques</contributor><creatorcontrib>Mazarura, Jocelyn</creatorcontrib><creatorcontrib>de Villiers, Pieter</creatorcontrib><creatorcontrib>de Waal, Alta</creatorcontrib><title>A Gamma-Poisson Mixture Topic Model for Short Text</title><title>Mathematical problems in engineering</title><description>Most topic models are constructed under the assumption that documents follow a multinomial distribution. The Poisson distribution is an alternative distribution to describe the probability of count data. For topic modelling, the Poisson distribution describes the number of occurrences of a word in documents of fixed length. The Poisson distribution has been successfully applied in text classification, but its application to topic modelling is not well documented, specifically in the context of a generative probabilistic model. Furthermore, the few Poisson topic models in the literature are admixture models, making the assumption that a document is generated from a mixture of topics. In this study, we focus on short text. Many studies have shown that the simpler assumption of a mixture model fits short text better. With mixture models, as opposed to admixture models, the generative assumption is that a document is generated from a single topic. One topic model, which makes this one-topic-per-document assumption, is the Dirichlet-multinomial mixture model. The main contributions of this work are a new Gamma-Poisson mixture model, as well as a collapsed Gibbs sampler for the model. The benefit of the collapsed Gibbs sampler derivation is that the model is able to automatically select the number of topics contained in the corpus. The results show that the Gamma-Poisson mixture model performs better than the Dirichlet-multinomial mixture model at selecting the number of topics in labelled corpora. Furthermore, the Gamma-Poisson mixture produces better topic coherence scores than the Dirichlet-multinomial mixture model, thus making it a viable option for the challenging task of topic modelling of short text.</description><subject>Admixtures</subject><subject>Binomial distribution</subject><subject>Dirichlet problem</subject><subject>Modelling</subject><subject>Poisson distribution</subject><subject>Probabilistic models</subject><subject>Product reviews</subject><subject>Researchers</subject><subject>Statistical analysis</subject><issn>1024-123X</issn><issn>1563-5147</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>RHX</sourceid><sourceid>BENPR</sourceid><recordid>eNqFkM1LAzEQxYMoWKs3z7LgUddm8rl7LEWr0KJgBW8hu0nolrapyS7W_96ULXj0NAPvx8x7D6FrwA8AnI8IJnjEJClwyU_QALigOQcmT9OOCcuB0M9zdBHjCmMCHIoBIuNsqjcbnb_5Jka_zebNvu2CzRZ-19TZ3Bu7zpwP2fvShzZb2H17ic6cXkd7dZxD9PH0uJg857PX6ctkPMtrKnCbS8KJ1ERXTpjSCC2hSiZZIUqRvBjniAQHNmlUAgVLtWEVxaJiJa4rI-gQ3fZ3d8F_dTa2auW7sE0vFWEpgGQ8BRqi-56qg48xWKd2odno8KMAq0Mr6tCKOraS8LseXzZbo7-b_-ibnraJsU7_0ZDUAugvJaBolw</recordid><startdate>2020</startdate><enddate>2020</enddate><creator>Mazarura, Jocelyn</creator><creator>de Villiers, Pieter</creator><creator>de Waal, Alta</creator><general>Hindawi Publishing Corporation</general><general>Hindawi</general><general>Hindawi Limited</general><scope>ADJCN</scope><scope>AHFXO</scope><scope>RHU</scope><scope>RHW</scope><scope>RHX</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7TB</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>CWDGH</scope><scope>DWQXO</scope><scope>FR3</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>KR7</scope><scope>L6V</scope><scope>M7S</scope><scope>P5Z</scope><scope>P62</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><orcidid>https://orcid.org/0000-0003-4598-0834</orcidid></search><sort><creationdate>2020</creationdate><title>A Gamma-Poisson Mixture Topic Model for Short Text</title><author>Mazarura, Jocelyn ; de Villiers, Pieter ; de Waal, Alta</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c360t-72527a2abf6d9d6a71b11548696024dff271f1ed9d37131e3ad4b306b490cbd63</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Admixtures</topic><topic>Binomial distribution</topic><topic>Dirichlet problem</topic><topic>Modelling</topic><topic>Poisson distribution</topic><topic>Probabilistic models</topic><topic>Product reviews</topic><topic>Researchers</topic><topic>Statistical analysis</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Mazarura, Jocelyn</creatorcontrib><creatorcontrib>de Villiers, Pieter</creatorcontrib><creatorcontrib>de Waal, Alta</creatorcontrib><collection>الدوريات العلمية والإحصائية - e-Marefa Academic and Statistical Periodicals</collection><collection>معرفة - المحتوى العربي الأكاديمي المتكامل - e-Marefa Academic Complete</collection><collection>Hindawi Publishing Complete</collection><collection>Hindawi Publishing Subscription Journals</collection><collection>Hindawi Publishing Open Access Journals</collection><collection>CrossRef</collection><collection>Mechanical &amp; Transportation Engineering Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>Middle East &amp; Africa Database</collection><collection>ProQuest Central Korea</collection><collection>Engineering Research Database</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>Civil Engineering Abstracts</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><jtitle>Mathematical problems in engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Mazarura, Jocelyn</au><au>de Villiers, Pieter</au><au>de Waal, Alta</au><au>Marques, Filipe J.</au><au>Filipe J Marques</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Gamma-Poisson Mixture Topic Model for Short Text</atitle><jtitle>Mathematical problems in engineering</jtitle><date>2020</date><risdate>2020</risdate><volume>2020</volume><issue>2020</issue><spage>1</spage><epage>17</epage><pages>1-17</pages><issn>1024-123X</issn><eissn>1563-5147</eissn><abstract>Most topic models are constructed under the assumption that documents follow a multinomial distribution. The Poisson distribution is an alternative distribution to describe the probability of count data. For topic modelling, the Poisson distribution describes the number of occurrences of a word in documents of fixed length. The Poisson distribution has been successfully applied in text classification, but its application to topic modelling is not well documented, specifically in the context of a generative probabilistic model. Furthermore, the few Poisson topic models in the literature are admixture models, making the assumption that a document is generated from a mixture of topics. In this study, we focus on short text. Many studies have shown that the simpler assumption of a mixture model fits short text better. With mixture models, as opposed to admixture models, the generative assumption is that a document is generated from a single topic. One topic model, which makes this one-topic-per-document assumption, is the Dirichlet-multinomial mixture model. The main contributions of this work are a new Gamma-Poisson mixture model, as well as a collapsed Gibbs sampler for the model. The benefit of the collapsed Gibbs sampler derivation is that the model is able to automatically select the number of topics contained in the corpus. The results show that the Gamma-Poisson mixture model performs better than the Dirichlet-multinomial mixture model at selecting the number of topics in labelled corpora. Furthermore, the Gamma-Poisson mixture produces better topic coherence scores than the Dirichlet-multinomial mixture model, thus making it a viable option for the challenging task of topic modelling of short text.</abstract><cop>Cairo, Egypt</cop><pub>Hindawi Publishing Corporation</pub><doi>10.1155/2020/4728095</doi><tpages>17</tpages><orcidid>https://orcid.org/0000-0003-4598-0834</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1024-123X
ispartof Mathematical problems in engineering, 2020, Vol.2020 (2020), p.1-17
issn 1024-123X
1563-5147
language eng
recordid cdi_proquest_journals_2400274512
source Wiley Online Library Open Access; EZB-FREE-00999 freely available EZB journals; Alma/SFX Local Collection
subjects Admixtures
Binomial distribution
Dirichlet problem
Modelling
Poisson distribution
Probabilistic models
Product reviews
Researchers
Statistical analysis
title A Gamma-Poisson Mixture Topic Model for Short Text
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-29T16%3A32%3A18IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Gamma-Poisson%20Mixture%20Topic%20Model%20for%20Short%20Text&rft.jtitle=Mathematical%20problems%20in%20engineering&rft.au=Mazarura,%20Jocelyn&rft.date=2020&rft.volume=2020&rft.issue=2020&rft.spage=1&rft.epage=17&rft.pages=1-17&rft.issn=1024-123X&rft.eissn=1563-5147&rft_id=info:doi/10.1155/2020/4728095&rft_dat=%3Cproquest_cross%3E2400274512%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2400274512&rft_id=info:pmid/&rfr_iscdi=true