Using backward elimination with a new model order reduction algorithm to select best double mixture model for document clustering

Probabilistic latent semantic analysis (PLSA) is a double structure mixture model which has got a wide application in text and web mining. This method is capable of establishing hidden semantic relations among the observed features, using a number of latent variables. In this approach, the selection...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Expert systems with applications 2009-09, Vol.36 (7), p.10485-10493
Hauptverfasser:	Azadi, Tahereh Emami, Almasganj, Farshad
Format:	Artikel
Sprache:	eng
Schlagworte:	Bayesian information criterion (BIC) Document clustering EM algorithm Model selection PLSA
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	10493
container_issue	7
container_start_page	10485
container_title	Expert systems with applications
container_volume	36
creator	Azadi, Tahereh Emami Almasganj, Farshad
description	Probabilistic latent semantic analysis (PLSA) is a double structure mixture model which has got a wide application in text and web mining. This method is capable of establishing hidden semantic relations among the observed features, using a number of latent variables. In this approach, the selection of the correct number of latent variables is critical. In the most of the previous researches, the number of latent topics was selected based on the number of invoked classes. This paper presents a method, based on backward elimination approach, which is capable of unsupervised order selection in PLSA. This method starts with a model having a number of components more than the needed value, and then prunes the mixtures to reach their optimum size. During the elimination process, proper selection of some latent variables which must be deleted is the most essential problem, and its relation to the final performance of the pruned model is straightforward. To treat this problem, we introduce a new combined pruning method which selects the best options for removal, while keeping a low computational cost, at all. We conducted some experiments on two datasets from Reuters-21578 corpus. The obtained results show that this algorithm leads to an optimized number of latent variables and in turn achieves better clustering performance compared to the conventional model selection methods. It also shows superiority over the case in which a PLSA model with a fixed number of latent variables, equal to the real number of clusters, is exploited.
doi_str_mv	10.1016/j.eswa.2009.01.068
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_903649567</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0957417409000694</els_id><sourcerecordid>34454007</sourcerecordid><originalsourceid>FETCH-LOGICAL-c363t-cc4fdf4dc9d8a8b0bc9413d48a3ef253ea38f4b425da5cc5f4e1f41706edcd483</originalsourceid><addsrcrecordid>eNp9kT9PHDEQxa2ISDku-QKpXEG1i332_pNoEIIE6aQ0oba84zHx4V0ftpdLynzz-HLUVFPM783Mm0fIV85qznh7tasxHXS9YWyoGa9Z238gK953omq7QZyRFRuarpK8k5_IeUo7xnjHWLcifx-Tm5_oqOH5oKOh6N3kZp1dmOnB5V9U0xkPdAoGPQ3RYKQRzQL_Ae2fQizQRHOgCT1CpiOmTE1YRo90cr_zEvFNbUMsDVgmnDMFv6SMsez-TD5a7RN-eatr8nh_9_P2e7X98e3h9mZbgWhFrgCkNVYaGEyv-5GNMEgujOy1QLtpBGrRWznKTWN0A9BYidwWw6xFAwUTa3J5mruP4WUpV6rJJUDv9YxhSWpgopVD03aFvHiXFFI2snyvgJsTCDGkFNGqfXSTjn8UZ-qYi9qpYy7qmItiXJVciuj6JMJi9tVhVAkczoDGxfJAZYJ7T_4PwIia2w</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>34454007</pqid></control><display><type>article</type><title>Using backward elimination with a new model order reduction algorithm to select best double mixture model for document clustering</title><source>ScienceDirect Journals (5 years ago - present)</source><creator>Azadi, Tahereh Emami ; Almasganj, Farshad</creator><creatorcontrib>Azadi, Tahereh Emami ; Almasganj, Farshad</creatorcontrib><description>Probabilistic latent semantic analysis (PLSA) is a double structure mixture model which has got a wide application in text and web mining. This method is capable of establishing hidden semantic relations among the observed features, using a number of latent variables. In this approach, the selection of the correct number of latent variables is critical. In the most of the previous researches, the number of latent topics was selected based on the number of invoked classes. This paper presents a method, based on backward elimination approach, which is capable of unsupervised order selection in PLSA. This method starts with a model having a number of components more than the needed value, and then prunes the mixtures to reach their optimum size. During the elimination process, proper selection of some latent variables which must be deleted is the most essential problem, and its relation to the final performance of the pruned model is straightforward. To treat this problem, we introduce a new combined pruning method which selects the best options for removal, while keeping a low computational cost, at all. We conducted some experiments on two datasets from Reuters-21578 corpus. The obtained results show that this algorithm leads to an optimized number of latent variables and in turn achieves better clustering performance compared to the conventional model selection methods. It also shows superiority over the case in which a PLSA model with a fixed number of latent variables, equal to the real number of clusters, is exploited.</description><identifier>ISSN: 0957-4174</identifier><identifier>EISSN: 1873-6793</identifier><identifier>DOI: 10.1016/j.eswa.2009.01.068</identifier><language>eng</language><publisher>Elsevier Ltd</publisher><subject>Bayesian information criterion (BIC) ; Document clustering ; EM algorithm ; Model selection ; PLSA</subject><ispartof>Expert systems with applications, 2009-09, Vol.36 (7), p.10485-10493</ispartof><rights>2009 Elsevier Ltd</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c363t-cc4fdf4dc9d8a8b0bc9413d48a3ef253ea38f4b425da5cc5f4e1f41706edcd483</citedby></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.eswa.2009.01.068$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,780,784,3550,27924,27925,45995</link.rule.ids></links><search><creatorcontrib>Azadi, Tahereh Emami</creatorcontrib><creatorcontrib>Almasganj, Farshad</creatorcontrib><title>Using backward elimination with a new model order reduction algorithm to select best double mixture model for document clustering</title><title>Expert systems with applications</title><description>Probabilistic latent semantic analysis (PLSA) is a double structure mixture model which has got a wide application in text and web mining. This method is capable of establishing hidden semantic relations among the observed features, using a number of latent variables. In this approach, the selection of the correct number of latent variables is critical. In the most of the previous researches, the number of latent topics was selected based on the number of invoked classes. This paper presents a method, based on backward elimination approach, which is capable of unsupervised order selection in PLSA. This method starts with a model having a number of components more than the needed value, and then prunes the mixtures to reach their optimum size. During the elimination process, proper selection of some latent variables which must be deleted is the most essential problem, and its relation to the final performance of the pruned model is straightforward. To treat this problem, we introduce a new combined pruning method which selects the best options for removal, while keeping a low computational cost, at all. We conducted some experiments on two datasets from Reuters-21578 corpus. The obtained results show that this algorithm leads to an optimized number of latent variables and in turn achieves better clustering performance compared to the conventional model selection methods. It also shows superiority over the case in which a PLSA model with a fixed number of latent variables, equal to the real number of clusters, is exploited.</description><subject>Bayesian information criterion (BIC)</subject><subject>Document clustering</subject><subject>EM algorithm</subject><subject>Model selection</subject><subject>PLSA</subject><issn>0957-4174</issn><issn>1873-6793</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2009</creationdate><recordtype>article</recordtype><recordid>eNp9kT9PHDEQxa2ISDku-QKpXEG1i332_pNoEIIE6aQ0oba84zHx4V0ftpdLynzz-HLUVFPM783Mm0fIV85qznh7tasxHXS9YWyoGa9Z238gK953omq7QZyRFRuarpK8k5_IeUo7xnjHWLcifx-Tm5_oqOH5oKOh6N3kZp1dmOnB5V9U0xkPdAoGPQ3RYKQRzQL_Ae2fQizQRHOgCT1CpiOmTE1YRo90cr_zEvFNbUMsDVgmnDMFv6SMsez-TD5a7RN-eatr8nh_9_P2e7X98e3h9mZbgWhFrgCkNVYaGEyv-5GNMEgujOy1QLtpBGrRWznKTWN0A9BYidwWw6xFAwUTa3J5mruP4WUpV6rJJUDv9YxhSWpgopVD03aFvHiXFFI2snyvgJsTCDGkFNGqfXSTjn8UZ-qYi9qpYy7qmItiXJVciuj6JMJi9tVhVAkczoDGxfJAZYJ7T_4PwIia2w</recordid><startdate>20090901</startdate><enddate>20090901</enddate><creator>Azadi, Tahereh Emami</creator><creator>Almasganj, Farshad</creator><general>Elsevier Ltd</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20090901</creationdate><title>Using backward elimination with a new model order reduction algorithm to select best double mixture model for document clustering</title><author>Azadi, Tahereh Emami ; Almasganj, Farshad</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c363t-cc4fdf4dc9d8a8b0bc9413d48a3ef253ea38f4b425da5cc5f4e1f41706edcd483</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2009</creationdate><topic>Bayesian information criterion (BIC)</topic><topic>Document clustering</topic><topic>EM algorithm</topic><topic>Model selection</topic><topic>PLSA</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Azadi, Tahereh Emami</creatorcontrib><creatorcontrib>Almasganj, Farshad</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Expert systems with applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Azadi, Tahereh Emami</au><au>Almasganj, Farshad</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Using backward elimination with a new model order reduction algorithm to select best double mixture model for document clustering</atitle><jtitle>Expert systems with applications</jtitle><date>2009-09-01</date><risdate>2009</risdate><volume>36</volume><issue>7</issue><spage>10485</spage><epage>10493</epage><pages>10485-10493</pages><issn>0957-4174</issn><eissn>1873-6793</eissn><abstract>Probabilistic latent semantic analysis (PLSA) is a double structure mixture model which has got a wide application in text and web mining. This method is capable of establishing hidden semantic relations among the observed features, using a number of latent variables. In this approach, the selection of the correct number of latent variables is critical. In the most of the previous researches, the number of latent topics was selected based on the number of invoked classes. This paper presents a method, based on backward elimination approach, which is capable of unsupervised order selection in PLSA. This method starts with a model having a number of components more than the needed value, and then prunes the mixtures to reach their optimum size. During the elimination process, proper selection of some latent variables which must be deleted is the most essential problem, and its relation to the final performance of the pruned model is straightforward. To treat this problem, we introduce a new combined pruning method which selects the best options for removal, while keeping a low computational cost, at all. We conducted some experiments on two datasets from Reuters-21578 corpus. The obtained results show that this algorithm leads to an optimized number of latent variables and in turn achieves better clustering performance compared to the conventional model selection methods. It also shows superiority over the case in which a PLSA model with a fixed number of latent variables, equal to the real number of clusters, is exploited.</abstract><pub>Elsevier Ltd</pub><doi>10.1016/j.eswa.2009.01.068</doi><tpages>9</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 0957-4174
ispartof	Expert systems with applications, 2009-09, Vol.36 (7), p.10485-10493
issn	0957-4174 1873-6793
language	eng
recordid	cdi_proquest_miscellaneous_903649567
source	ScienceDirect Journals (5 years ago - present)
subjects	Bayesian information criterion (BIC) Document clustering EM algorithm Model selection PLSA
title	Using backward elimination with a new model order reduction algorithm to select best double mixture model for document clustering
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T17%3A49%3A11IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Using%20backward%20elimination%20with%20a%20new%20model%20order%20reduction%20algorithm%20to%20select%20best%20double%20mixture%20model%20for%20document%20clustering&rft.jtitle=Expert%20systems%20with%20applications&rft.au=Azadi,%20Tahereh%20Emami&rft.date=2009-09-01&rft.volume=36&rft.issue=7&rft.spage=10485&rft.epage=10493&rft.pages=10485-10493&rft.issn=0957-4174&rft.eissn=1873-6793&rft_id=info:doi/10.1016/j.eswa.2009.01.068&rft_dat=%3Cproquest_cross%3E34454007%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=34454007&rft_id=info:pmid/&rft_els_id=S0957417409000694&rfr_iscdi=true