Feature selection for clustering categorical data with an embedded modelling approach
Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of th...
Gespeichert in:
Veröffentlicht in: | Expert systems 2015-06, Vol.32 (3), p.444-453 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 453 |
---|---|
container_issue | 3 |
container_start_page | 444 |
container_title | Expert systems |
container_volume | 32 |
creator | Silvestre, Cláudia Cardoso, Margarida G. M. S. Figueiredo, Mário |
description | Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation‐maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation‐maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness. |
doi_str_mv | 10.1111/exsy.12082 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_1686794083</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3708637491</sourcerecordid><originalsourceid>FETCH-LOGICAL-c4002-12ca3d1d63c51ebc26b9b5e67ccdaa584efe6c9d583ccf44508f5f977bd2f4643</originalsourceid><addsrcrecordid>eNp9kEtLw0AUhYMoWKsbf8GAOyF1XplJllJqKxQVfFTdDJOZmzY1TepMQtt_b2rUpXdzN985B74gOCd4QNq7gq3fDQjFMT0IeoSLOMQs4YdBD1MhQi4pPg5OvF9ijImUohc834CuGwfIQwGmzqsSZZVDpmh8DS4v58joGuaVy40ukNW1Rpu8XiBdIlilYC1YtKosFMWe1eu1q7RZnAZHmS48nP38frszehpOwun9-HZ4PQ0Nx5iGhBrNLLGCmYhAaqhIkzQCIY2xWkcxhwyESWwUM2MyziMcZ1GWSJlamnHBWT-46Hrb2c8GfK2WVePKdlIREQuZcByzlrrsKOMq7x1kau3ylXY7RbDaa1N7bepbWwuTDt7kBez-IdXo9fHtNxN2mbyVtv3LaPehhGQyUrO7sYriCZm9PxD1wr4AwJSAjA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1686794083</pqid></control><display><type>article</type><title>Feature selection for clustering categorical data with an embedded modelling approach</title><source>Wiley Journals</source><source>EBSCOhost Business Source Complete</source><creator>Silvestre, Cláudia ; Cardoso, Margarida G. M. S. ; Figueiredo, Mário</creator><creatorcontrib>Silvestre, Cláudia ; Cardoso, Margarida G. M. S. ; Figueiredo, Mário</creatorcontrib><description>Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation‐maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation‐maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.</description><identifier>ISSN: 0266-4720</identifier><identifier>EISSN: 1468-0394</identifier><identifier>DOI: 10.1111/exsy.12082</identifier><language>eng</language><publisher>Oxford: Blackwell Publishing Ltd</publisher><subject>Algorithms ; categorical features ; cluster analysis ; Clustering ; EM algorithm ; Feature selection ; finite mixtures models ; Studies</subject><ispartof>Expert systems, 2015-06, Vol.32 (3), p.444-453</ispartof><rights>2014 Wiley Publishing Ltd</rights><rights>Copyright © 2015 John Wiley & Sons, Ltd.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c4002-12ca3d1d63c51ebc26b9b5e67ccdaa584efe6c9d583ccf44508f5f977bd2f4643</citedby><cites>FETCH-LOGICAL-c4002-12ca3d1d63c51ebc26b9b5e67ccdaa584efe6c9d583ccf44508f5f977bd2f4643</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://onlinelibrary.wiley.com/doi/pdf/10.1111%2Fexsy.12082$$EPDF$$P50$$Gwiley$$H</linktopdf><linktohtml>$$Uhttps://onlinelibrary.wiley.com/doi/full/10.1111%2Fexsy.12082$$EHTML$$P50$$Gwiley$$H</linktohtml><link.rule.ids>314,780,784,1417,27924,27925,45574,45575</link.rule.ids></links><search><creatorcontrib>Silvestre, Cláudia</creatorcontrib><creatorcontrib>Cardoso, Margarida G. M. S.</creatorcontrib><creatorcontrib>Figueiredo, Mário</creatorcontrib><title>Feature selection for clustering categorical data with an embedded modelling approach</title><title>Expert systems</title><addtitle>Expert Systems</addtitle><description>Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation‐maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation‐maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.</description><subject>Algorithms</subject><subject>categorical features</subject><subject>cluster analysis</subject><subject>Clustering</subject><subject>EM algorithm</subject><subject>Feature selection</subject><subject>finite mixtures models</subject><subject>Studies</subject><issn>0266-4720</issn><issn>1468-0394</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2015</creationdate><recordtype>article</recordtype><recordid>eNp9kEtLw0AUhYMoWKsbf8GAOyF1XplJllJqKxQVfFTdDJOZmzY1TepMQtt_b2rUpXdzN985B74gOCd4QNq7gq3fDQjFMT0IeoSLOMQs4YdBD1MhQi4pPg5OvF9ijImUohc834CuGwfIQwGmzqsSZZVDpmh8DS4v58joGuaVy40ukNW1Rpu8XiBdIlilYC1YtKosFMWe1eu1q7RZnAZHmS48nP38frszehpOwun9-HZ4PQ0Nx5iGhBrNLLGCmYhAaqhIkzQCIY2xWkcxhwyESWwUM2MyziMcZ1GWSJlamnHBWT-46Hrb2c8GfK2WVePKdlIREQuZcByzlrrsKOMq7x1kau3ylXY7RbDaa1N7bepbWwuTDt7kBez-IdXo9fHtNxN2mbyVtv3LaPehhGQyUrO7sYriCZm9PxD1wr4AwJSAjA</recordid><startdate>201506</startdate><enddate>201506</enddate><creator>Silvestre, Cláudia</creator><creator>Cardoso, Margarida G. M. S.</creator><creator>Figueiredo, Mário</creator><general>Blackwell Publishing Ltd</general><scope>BSCLL</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7TB</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>201506</creationdate><title>Feature selection for clustering categorical data with an embedded modelling approach</title><author>Silvestre, Cláudia ; Cardoso, Margarida G. M. S. ; Figueiredo, Mário</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c4002-12ca3d1d63c51ebc26b9b5e67ccdaa584efe6c9d583ccf44508f5f977bd2f4643</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2015</creationdate><topic>Algorithms</topic><topic>categorical features</topic><topic>cluster analysis</topic><topic>Clustering</topic><topic>EM algorithm</topic><topic>Feature selection</topic><topic>finite mixtures models</topic><topic>Studies</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Silvestre, Cláudia</creatorcontrib><creatorcontrib>Cardoso, Margarida G. M. S.</creatorcontrib><creatorcontrib>Figueiredo, Mário</creatorcontrib><collection>Istex</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Mechanical & Transportation Engineering Abstracts</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology & Engineering</collection><collection>Engineering Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Expert systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Silvestre, Cláudia</au><au>Cardoso, Margarida G. M. S.</au><au>Figueiredo, Mário</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Feature selection for clustering categorical data with an embedded modelling approach</atitle><jtitle>Expert systems</jtitle><addtitle>Expert Systems</addtitle><date>2015-06</date><risdate>2015</risdate><volume>32</volume><issue>3</issue><spage>444</spage><epage>453</epage><pages>444-453</pages><issn>0266-4720</issn><eissn>1468-0394</eissn><abstract>Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation‐maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation‐maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.</abstract><cop>Oxford</cop><pub>Blackwell Publishing Ltd</pub><doi>10.1111/exsy.12082</doi><tpages>10</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0266-4720 |
ispartof | Expert systems, 2015-06, Vol.32 (3), p.444-453 |
issn | 0266-4720 1468-0394 |
language | eng |
recordid | cdi_proquest_journals_1686794083 |
source | Wiley Journals; EBSCOhost Business Source Complete |
subjects | Algorithms categorical features cluster analysis Clustering EM algorithm Feature selection finite mixtures models Studies |
title | Feature selection for clustering categorical data with an embedded modelling approach |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T09%3A13%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Feature%20selection%20for%20clustering%20categorical%20data%20with%20an%20embedded%20modelling%20approach&rft.jtitle=Expert%20systems&rft.au=Silvestre,%20Cl%C3%A1udia&rft.date=2015-06&rft.volume=32&rft.issue=3&rft.spage=444&rft.epage=453&rft.pages=444-453&rft.issn=0266-4720&rft.eissn=1468-0394&rft_id=info:doi/10.1111/exsy.12082&rft_dat=%3Cproquest_cross%3E3708637491%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1686794083&rft_id=info:pmid/&rfr_iscdi=true |