Feature selection for clustering categorical data with an embedded modelling approach

Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of th...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Expert systems 2015-06, Vol.32 (3), p.444-453
Hauptverfasser: Silvestre, Cláudia, Cardoso, Margarida G. M. S., Figueiredo, Mário
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 453
container_issue 3
container_start_page 444
container_title Expert systems
container_volume 32
creator Silvestre, Cláudia
Cardoso, Margarida G. M. S.
Figueiredo, Mário
description Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation‐maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation‐maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.
doi_str_mv 10.1111/exsy.12082
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_1686794083</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3708637491</sourcerecordid><originalsourceid>FETCH-LOGICAL-c4002-12ca3d1d63c51ebc26b9b5e67ccdaa584efe6c9d583ccf44508f5f977bd2f4643</originalsourceid><addsrcrecordid>eNp9kEtLw0AUhYMoWKsbf8GAOyF1XplJllJqKxQVfFTdDJOZmzY1TepMQtt_b2rUpXdzN985B74gOCd4QNq7gq3fDQjFMT0IeoSLOMQs4YdBD1MhQi4pPg5OvF9ijImUohc834CuGwfIQwGmzqsSZZVDpmh8DS4v58joGuaVy40ukNW1Rpu8XiBdIlilYC1YtKosFMWe1eu1q7RZnAZHmS48nP38frszehpOwun9-HZ4PQ0Nx5iGhBrNLLGCmYhAaqhIkzQCIY2xWkcxhwyESWwUM2MyziMcZ1GWSJlamnHBWT-46Hrb2c8GfK2WVePKdlIREQuZcByzlrrsKOMq7x1kau3ylXY7RbDaa1N7bepbWwuTDt7kBez-IdXo9fHtNxN2mbyVtv3LaPehhGQyUrO7sYriCZm9PxD1wr4AwJSAjA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1686794083</pqid></control><display><type>article</type><title>Feature selection for clustering categorical data with an embedded modelling approach</title><source>Wiley Journals</source><source>EBSCOhost Business Source Complete</source><creator>Silvestre, Cláudia ; Cardoso, Margarida G. M. S. ; Figueiredo, Mário</creator><creatorcontrib>Silvestre, Cláudia ; Cardoso, Margarida G. M. S. ; Figueiredo, Mário</creatorcontrib><description>Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation‐maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation‐maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.</description><identifier>ISSN: 0266-4720</identifier><identifier>EISSN: 1468-0394</identifier><identifier>DOI: 10.1111/exsy.12082</identifier><language>eng</language><publisher>Oxford: Blackwell Publishing Ltd</publisher><subject>Algorithms ; categorical features ; cluster analysis ; Clustering ; EM algorithm ; Feature selection ; finite mixtures models ; Studies</subject><ispartof>Expert systems, 2015-06, Vol.32 (3), p.444-453</ispartof><rights>2014 Wiley Publishing Ltd</rights><rights>Copyright © 2015 John Wiley &amp; Sons, Ltd.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c4002-12ca3d1d63c51ebc26b9b5e67ccdaa584efe6c9d583ccf44508f5f977bd2f4643</citedby><cites>FETCH-LOGICAL-c4002-12ca3d1d63c51ebc26b9b5e67ccdaa584efe6c9d583ccf44508f5f977bd2f4643</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://onlinelibrary.wiley.com/doi/pdf/10.1111%2Fexsy.12082$$EPDF$$P50$$Gwiley$$H</linktopdf><linktohtml>$$Uhttps://onlinelibrary.wiley.com/doi/full/10.1111%2Fexsy.12082$$EHTML$$P50$$Gwiley$$H</linktohtml><link.rule.ids>314,780,784,1417,27924,27925,45574,45575</link.rule.ids></links><search><creatorcontrib>Silvestre, Cláudia</creatorcontrib><creatorcontrib>Cardoso, Margarida G. M. S.</creatorcontrib><creatorcontrib>Figueiredo, Mário</creatorcontrib><title>Feature selection for clustering categorical data with an embedded modelling approach</title><title>Expert systems</title><addtitle>Expert Systems</addtitle><description>Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation‐maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation‐maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.</description><subject>Algorithms</subject><subject>categorical features</subject><subject>cluster analysis</subject><subject>Clustering</subject><subject>EM algorithm</subject><subject>Feature selection</subject><subject>finite mixtures models</subject><subject>Studies</subject><issn>0266-4720</issn><issn>1468-0394</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2015</creationdate><recordtype>article</recordtype><recordid>eNp9kEtLw0AUhYMoWKsbf8GAOyF1XplJllJqKxQVfFTdDJOZmzY1TepMQtt_b2rUpXdzN985B74gOCd4QNq7gq3fDQjFMT0IeoSLOMQs4YdBD1MhQi4pPg5OvF9ijImUohc834CuGwfIQwGmzqsSZZVDpmh8DS4v58joGuaVy40ukNW1Rpu8XiBdIlilYC1YtKosFMWe1eu1q7RZnAZHmS48nP38frszehpOwun9-HZ4PQ0Nx5iGhBrNLLGCmYhAaqhIkzQCIY2xWkcxhwyESWwUM2MyziMcZ1GWSJlamnHBWT-46Hrb2c8GfK2WVePKdlIREQuZcByzlrrsKOMq7x1kau3ylXY7RbDaa1N7bepbWwuTDt7kBez-IdXo9fHtNxN2mbyVtv3LaPehhGQyUrO7sYriCZm9PxD1wr4AwJSAjA</recordid><startdate>201506</startdate><enddate>201506</enddate><creator>Silvestre, Cláudia</creator><creator>Cardoso, Margarida G. M. S.</creator><creator>Figueiredo, Mário</creator><general>Blackwell Publishing Ltd</general><scope>BSCLL</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7TB</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>201506</creationdate><title>Feature selection for clustering categorical data with an embedded modelling approach</title><author>Silvestre, Cláudia ; Cardoso, Margarida G. M. S. ; Figueiredo, Mário</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c4002-12ca3d1d63c51ebc26b9b5e67ccdaa584efe6c9d583ccf44508f5f977bd2f4643</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2015</creationdate><topic>Algorithms</topic><topic>categorical features</topic><topic>cluster analysis</topic><topic>Clustering</topic><topic>EM algorithm</topic><topic>Feature selection</topic><topic>finite mixtures models</topic><topic>Studies</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Silvestre, Cláudia</creatorcontrib><creatorcontrib>Cardoso, Margarida G. M. S.</creatorcontrib><creatorcontrib>Figueiredo, Mário</creatorcontrib><collection>Istex</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Mechanical &amp; Transportation Engineering Abstracts</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology &amp; Engineering</collection><collection>Engineering Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Expert systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Silvestre, Cláudia</au><au>Cardoso, Margarida G. M. S.</au><au>Figueiredo, Mário</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Feature selection for clustering categorical data with an embedded modelling approach</atitle><jtitle>Expert systems</jtitle><addtitle>Expert Systems</addtitle><date>2015-06</date><risdate>2015</risdate><volume>32</volume><issue>3</issue><spage>444</spage><epage>453</epage><pages>444-453</pages><issn>0266-4720</issn><eissn>1468-0394</eissn><abstract>Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation‐maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation‐maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.</abstract><cop>Oxford</cop><pub>Blackwell Publishing Ltd</pub><doi>10.1111/exsy.12082</doi><tpages>10</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0266-4720
ispartof Expert systems, 2015-06, Vol.32 (3), p.444-453
issn 0266-4720
1468-0394
language eng
recordid cdi_proquest_journals_1686794083
source Wiley Journals; EBSCOhost Business Source Complete
subjects Algorithms
categorical features
cluster analysis
Clustering
EM algorithm
Feature selection
finite mixtures models
Studies
title Feature selection for clustering categorical data with an embedded modelling approach
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T09%3A13%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Feature%20selection%20for%20clustering%20categorical%20data%20with%20an%20embedded%20modelling%20approach&rft.jtitle=Expert%20systems&rft.au=Silvestre,%20Cl%C3%A1udia&rft.date=2015-06&rft.volume=32&rft.issue=3&rft.spage=444&rft.epage=453&rft.pages=444-453&rft.issn=0266-4720&rft.eissn=1468-0394&rft_id=info:doi/10.1111/exsy.12082&rft_dat=%3Cproquest_cross%3E3708637491%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1686794083&rft_id=info:pmid/&rfr_iscdi=true