Feature selection for clustering categorical data with an embedded modelling approach

Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of th...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Expert systems 2015-06, Vol.32 (3), p.444-453
Hauptverfasser:	Silvestre, Cláudia, Cardoso, Margarida G. M. S., Figueiredo, Mário
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms categorical features cluster analysis Clustering EM algorithm Feature selection finite mixtures models Studies
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	453
container_issue	3
container_start_page	444
container_title	Expert systems
container_volume	32
creator	Silvestre, Cláudia Cardoso, Margarida G. M. S. Figueiredo, Mário
description	Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation‐maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation‐maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.
doi_str_mv	10.1111/exsy.12082
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_1686794083</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3708637491</sourcerecordid><originalsourceid>FETCH-LOGICAL-c4002-12ca3d1d63c51ebc26b9b5e67ccdaa584efe6c9d583ccf44508f5f977bd2f4643</originalsourceid><addsrcrecordid>eNp9kEtLw0AUhYMoWKsbf8GAOyF1XplJllJqKxQVfFTdDJOZmzY1TepMQtt_b2rUpXdzN985B74gOCd4QNq7gq3fDQjFMT0IeoSLOMQs4YdBD1MhQi4pPg5OvF9ijImUohc834CuGwfIQwGmzqsSZZVDpmh8DS4v58joGuaVy40ukNW1Rpu8XiBdIlilYC1YtKosFMWe1eu1q7RZnAZHmS48nP38frszehpOwun9-HZ4PQ0Nx5iGhBrNLLGCmYhAaqhIkzQCIY2xWkcxhwyESWwUM2MyziMcZ1GWSJlamnHBWT-46Hrb2c8GfK2WVePKdlIREQuZcByzlrrsKOMq7x1kau3ylXY7RbDaa1N7bepbWwuTDt7kBez-IdXo9fHtNxN2mbyVtv3LaPehhGQyUrO7sYriCZm9PxD1wr4AwJSAjA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1686794083</pqid></control><display><type>article</type><title>Feature selection for clustering categorical data with an embedded modelling approach</title><source>Wiley Journals</source><source>EBSCOhost Business Source Complete</source><creator>Silvestre, Cláudia ; Cardoso, Margarida G. M. S. ; Figueiredo, Mário</creator><creatorcontrib>Silvestre, Cláudia ; Cardoso, Margarida G. M. S. ; Figueiredo, Mário</creatorcontrib><description>Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation‐maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation‐maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.</description><identifier>ISSN: 0266-4720</identifier><identifier>EISSN: 1468-0394</identifier><identifier>DOI: 10.1111/exsy.12082</identifier><language>eng</language><publisher>Oxford: Blackwell Publishing Ltd</publisher><subject>Algorithms ; categorical features ; cluster analysis ; Clustering ; EM algorithm ; Feature selection ; finite mixtures models ; Studies</subject><ispartof>Expert systems, 2015-06, Vol.32 (3), p.444-453</ispartof><rights>2014 Wiley Publishing Ltd</rights><rights>Copyright © 2015 John Wiley & Sons, Ltd.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c4002-12ca3d1d63c51ebc26b9b5e67ccdaa584efe6c9d583ccf44508f5f977bd2f4643</citedby><cites>FETCH-LOGICAL-c4002-12ca3d1d63c51ebc26b9b5e67ccdaa584efe6c9d583ccf44508f5f977bd2f4643</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://onlinelibrary.wiley.com/doi/pdf/10.1111%2Fexsy.12082$$EPDF$$P50$$Gwiley$$H</linktopdf><linktohtml>$$Uhttps://onlinelibrary.wiley.com/doi/full/10.1111%2Fexsy.12082$$EHTML$$P50$$Gwiley$$H</linktohtml><link.rule.ids>314,780,784,1417,27924,27925,45574,45575</link.rule.ids></links><search><creatorcontrib>Silvestre, Cláudia</creatorcontrib><creatorcontrib>Cardoso, Margarida G. M. S.</creatorcontrib><creatorcontrib>Figueiredo, Mário</creatorcontrib><title>Feature selection for clustering categorical data with an embedded modelling approach</title><title>Expert systems</title><addtitle>Expert Systems</addtitle><description>Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation‐maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation‐maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.</description><subject>Algorithms</subject><subject>categorical features</subject><subject>cluster analysis</subject><subject>Clustering</subject><subject>EM algorithm</subject><subject>Feature selection</subject><subject>finite mixtures models</subject><subject>Studies</subject><issn>0266-4720</issn><issn>1468-0394</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2015</creationdate><recordtype>article</recordtype><recordid>eNp9kEtLw0AUhYMoWKsbf8GAOyF1XplJllJqKxQVfFTdDJOZmzY1TepMQtt_b2rUpXdzN985B74gOCd4QNq7gq3fDQjFMT0IeoSLOMQs4YdBD1MhQi4pPg5OvF9ijImUohc834CuGwfIQwGmzqsSZZVDpmh8DS4v58joGuaVy40ukNW1Rpu8XiBdIlilYC1YtKosFMWe1eu1q7RZnAZHmS48nP38frszehpOwun9-HZ4PQ0Nx5iGhBrNLLGCmYhAaqhIkzQCIY2xWkcxhwyESWwUM2MyziMcZ1GWSJlamnHBWT-46Hrb2c8GfK2WVePKdlIREQuZcByzlrrsKOMq7x1kau3ylXY7RbDaa1N7bepbWwuTDt7kBez-IdXo9fHtNxN2mbyVtv3LaPehhGQyUrO7sYriCZm9PxD1wr4AwJSAjA</recordid><startdate>201506</startdate><enddate>201506</enddate><creator>Silvestre, Cláudia</creator><creator>Cardoso, Margarida G. M. S.</creator><creator>Figueiredo, Mário</creator><general>Blackwell Publishing Ltd</general><scope>BSCLL</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7TB</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>201506</creationdate><title>Feature selection for clustering categorical data with an embedded modelling approach</title><author>Silvestre, Cláudia ; Cardoso, Margarida G. M. S. ; Figueiredo, Mário</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c4002-12ca3d1d63c51ebc26b9b5e67ccdaa584efe6c9d583ccf44508f5f977bd2f4643</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2015</creationdate><topic>Algorithms</topic><topic>categorical features</topic><topic>cluster analysis</topic><topic>Clustering</topic><topic>EM algorithm</topic><topic>Feature selection</topic><topic>finite mixtures models</topic><topic>Studies</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Silvestre, Cláudia</creatorcontrib><creatorcontrib>Cardoso, Margarida G. M. S.</creatorcontrib><creatorcontrib>Figueiredo, Mário</creatorcontrib><collection>Istex</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Mechanical & Transportation Engineering Abstracts</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology & Engineering</collection><collection>Engineering Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Expert systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Silvestre, Cláudia</au><au>Cardoso, Margarida G. M. S.</au><au>Figueiredo, Mário</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Feature selection for clustering categorical data with an embedded modelling approach</atitle><jtitle>Expert systems</jtitle><addtitle>Expert Systems</addtitle><date>2015-06</date><risdate>2015</risdate><volume>32</volume><issue>3</issue><spage>444</spage><epage>453</epage><pages>444-453</pages><issn>0266-4720</issn><eissn>1468-0394</eissn><abstract>Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation‐maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation‐maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.</abstract><cop>Oxford</cop><pub>Blackwell Publishing Ltd</pub><doi>10.1111/exsy.12082</doi><tpages>10</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0266-4720
ispartof	Expert systems, 2015-06, Vol.32 (3), p.444-453
issn	0266-4720 1468-0394
language	eng
recordid	cdi_proquest_journals_1686794083
source	Wiley Journals; EBSCOhost Business Source Complete
subjects	Algorithms categorical features cluster analysis Clustering EM algorithm Feature selection finite mixtures models Studies
title	Feature selection for clustering categorical data with an embedded modelling approach
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T09%3A13%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Feature%20selection%20for%20clustering%20categorical%20data%20with%20an%20embedded%20modelling%20approach&rft.jtitle=Expert%20systems&rft.au=Silvestre,%20Cl%C3%A1udia&rft.date=2015-06&rft.volume=32&rft.issue=3&rft.spage=444&rft.epage=453&rft.pages=444-453&rft.issn=0266-4720&rft.eissn=1468-0394&rft_id=info:doi/10.1111/exsy.12082&rft_dat=%3Cproquest_cross%3E3708637491%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1686794083&rft_id=info:pmid/&rfr_iscdi=true