Numero: a statistical framework to define multivariable subgroups in complex population-based datasets

Abstract Large-scale epidemiological and population data provide opportunities to identify subgroups of people who are at risk of disease or exposed to adverse environments. Clustering algorithms are popular data-driven tools to identify these subgroups; however, relying exclusively on algorithms ma...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	International journal of epidemiology 2019-04, Vol.48 (2), p.369-374
Hauptverfasser:	Gao, Song, Mutter, Stefan, Casey, Aaron, Mäkinen, Ville-Petteri
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Datasets as Topic Epidemiologic Methods Humans Statistics as Topic
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	374
container_issue	2
container_start_page	369
container_title	International journal of epidemiology
container_volume	48
creator	Gao, Song Mutter, Stefan Casey, Aaron Mäkinen, Ville-Petteri
description	Abstract Large-scale epidemiological and population data provide opportunities to identify subgroups of people who are at risk of disease or exposed to adverse environments. Clustering algorithms are popular data-driven tools to identify these subgroups; however, relying exclusively on algorithms may not produce the best results if the dataset does not have a clustered structure. For this reason, we propose a framework (the R-library Numero) that combines the self-organizing map algorithm, permutation analysis for statistical evidence and a final expert-driven subgrouping step. We used Numero to define subgroups in two examples without an obvious clustering structure: a biomedical dataset of kidney disease and another dataset of community-level socioeconomic indicators. We benchmarked the Numero subgroupings against popular clustering algorithms (principal components, K-means and hierarchical clustering). The Numero subgroupings were more intuitive and easier to interpret without losing mathematical quality. Therefore, we expect Numero to be useful for exploratory analyses of population-based epidemiological datasets.
doi_str_mv	10.1093/ije/dyy113
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_2060866581</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><oup_id>10.1093/ije/dyy113</oup_id><sourcerecordid>2060866581</sourcerecordid><originalsourceid>FETCH-LOGICAL-c353t-4a5421c3a0d70674e88030191999e20ef57adaa7135947cb6b972d1c1a9263903</originalsourceid><addsrcrecordid>eNp9kM1KxDAYRYMoOv5sfADJRhChTtKkSeNOxD8Q3ei6fG2_SsZ2UpNUnbc3MurS1d0cDtxDyCFnZ5wZMbcLnLerFedig8y4VDITqiw2yYwJxrJCa75DdkNYMMallGab7OTGSK1VPiPdwzSgd-cUaIgQbYi2gZ52Hgb8cP6VRkdb7OwS6TD10b6Dt1D3SMNUv3g3jYHaJW3cMPb4SUc3Tn2yuGVWQ8CWthDTxrBPtjroAx787B55vr56urzN7h9v7i4v7rNGFCJmEgqZ80YAazVTWmJZpg_ccGMM5gy7QkMLoLko0oGmVrXRecsbDiZXwjCxR07W3tG7twlDrAYbGux7WKKbQpUzxUqlipIn9HSNNt6F4LGrRm8H8KuKs-q7a5W6VuuuCT768U71gO0f-hsyAcdrICX5T_QFBoCB-Q</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2060866581</pqid></control><display><type>article</type><title>Numero: a statistical framework to define multivariable subgroups in complex population-based datasets</title><source>Oxford University Press Journals All Titles (1996-Current)</source><source>MEDLINE</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>Alma/SFX Local Collection</source><creator>Gao, Song ; Mutter, Stefan ; Casey, Aaron ; Mäkinen, Ville-Petteri</creator><creatorcontrib>Gao, Song ; Mutter, Stefan ; Casey, Aaron ; Mäkinen, Ville-Petteri</creatorcontrib><description>Abstract Large-scale epidemiological and population data provide opportunities to identify subgroups of people who are at risk of disease or exposed to adverse environments. Clustering algorithms are popular data-driven tools to identify these subgroups; however, relying exclusively on algorithms may not produce the best results if the dataset does not have a clustered structure. For this reason, we propose a framework (the R-library Numero) that combines the self-organizing map algorithm, permutation analysis for statistical evidence and a final expert-driven subgrouping step. We used Numero to define subgroups in two examples without an obvious clustering structure: a biomedical dataset of kidney disease and another dataset of community-level socioeconomic indicators. We benchmarked the Numero subgroupings against popular clustering algorithms (principal components, K-means and hierarchical clustering). The Numero subgroupings were more intuitive and easier to interpret without losing mathematical quality. Therefore, we expect Numero to be useful for exploratory analyses of population-based epidemiological datasets.</description><identifier>ISSN: 0300-5771</identifier><identifier>EISSN: 1464-3685</identifier><identifier>DOI: 10.1093/ije/dyy113</identifier><identifier>PMID: 29947762</identifier><language>eng</language><publisher>England: Oxford University Press</publisher><subject>Algorithms ; Datasets as Topic ; Epidemiologic Methods ; Humans ; Statistics as Topic</subject><ispartof>International journal of epidemiology, 2019-04, Vol.48 (2), p.369-374</ispartof><rights>The Author(s) 2018; all rights reserved. Published by Oxford University Press on behalf of the International Epidemiological Association 2018</rights><rights>The Author(s) 2018; all rights reserved. Published by Oxford University Press on behalf of the International Epidemiological Association.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c353t-4a5421c3a0d70674e88030191999e20ef57adaa7135947cb6b972d1c1a9263903</citedby><cites>FETCH-LOGICAL-c353t-4a5421c3a0d70674e88030191999e20ef57adaa7135947cb6b972d1c1a9263903</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,1578,27901,27902</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/29947762$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Gao, Song</creatorcontrib><creatorcontrib>Mutter, Stefan</creatorcontrib><creatorcontrib>Casey, Aaron</creatorcontrib><creatorcontrib>Mäkinen, Ville-Petteri</creatorcontrib><title>Numero: a statistical framework to define multivariable subgroups in complex population-based datasets</title><title>International journal of epidemiology</title><addtitle>Int J Epidemiol</addtitle><description>Abstract Large-scale epidemiological and population data provide opportunities to identify subgroups of people who are at risk of disease or exposed to adverse environments. Clustering algorithms are popular data-driven tools to identify these subgroups; however, relying exclusively on algorithms may not produce the best results if the dataset does not have a clustered structure. For this reason, we propose a framework (the R-library Numero) that combines the self-organizing map algorithm, permutation analysis for statistical evidence and a final expert-driven subgrouping step. We used Numero to define subgroups in two examples without an obvious clustering structure: a biomedical dataset of kidney disease and another dataset of community-level socioeconomic indicators. We benchmarked the Numero subgroupings against popular clustering algorithms (principal components, K-means and hierarchical clustering). The Numero subgroupings were more intuitive and easier to interpret without losing mathematical quality. Therefore, we expect Numero to be useful for exploratory analyses of population-based epidemiological datasets.</description><subject>Algorithms</subject><subject>Datasets as Topic</subject><subject>Epidemiologic Methods</subject><subject>Humans</subject><subject>Statistics as Topic</subject><issn>0300-5771</issn><issn>1464-3685</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNp9kM1KxDAYRYMoOv5sfADJRhChTtKkSeNOxD8Q3ei6fG2_SsZ2UpNUnbc3MurS1d0cDtxDyCFnZ5wZMbcLnLerFedig8y4VDITqiw2yYwJxrJCa75DdkNYMMallGab7OTGSK1VPiPdwzSgd-cUaIgQbYi2gZ52Hgb8cP6VRkdb7OwS6TD10b6Dt1D3SMNUv3g3jYHaJW3cMPb4SUc3Tn2yuGVWQ8CWthDTxrBPtjroAx787B55vr56urzN7h9v7i4v7rNGFCJmEgqZ80YAazVTWmJZpg_ccGMM5gy7QkMLoLko0oGmVrXRecsbDiZXwjCxR07W3tG7twlDrAYbGux7WKKbQpUzxUqlipIn9HSNNt6F4LGrRm8H8KuKs-q7a5W6VuuuCT768U71gO0f-hsyAcdrICX5T_QFBoCB-Q</recordid><startdate>20190401</startdate><enddate>20190401</enddate><creator>Gao, Song</creator><creator>Mutter, Stefan</creator><creator>Casey, Aaron</creator><creator>Mäkinen, Ville-Petteri</creator><general>Oxford University Press</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope></search><sort><creationdate>20190401</creationdate><title>Numero: a statistical framework to define multivariable subgroups in complex population-based datasets</title><author>Gao, Song ; Mutter, Stefan ; Casey, Aaron ; Mäkinen, Ville-Petteri</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c353t-4a5421c3a0d70674e88030191999e20ef57adaa7135947cb6b972d1c1a9263903</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Algorithms</topic><topic>Datasets as Topic</topic><topic>Epidemiologic Methods</topic><topic>Humans</topic><topic>Statistics as Topic</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Gao, Song</creatorcontrib><creatorcontrib>Mutter, Stefan</creatorcontrib><creatorcontrib>Casey, Aaron</creatorcontrib><creatorcontrib>Mäkinen, Ville-Petteri</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>International journal of epidemiology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Gao, Song</au><au>Mutter, Stefan</au><au>Casey, Aaron</au><au>Mäkinen, Ville-Petteri</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Numero: a statistical framework to define multivariable subgroups in complex population-based datasets</atitle><jtitle>International journal of epidemiology</jtitle><addtitle>Int J Epidemiol</addtitle><date>2019-04-01</date><risdate>2019</risdate><volume>48</volume><issue>2</issue><spage>369</spage><epage>374</epage><pages>369-374</pages><issn>0300-5771</issn><eissn>1464-3685</eissn><abstract>Abstract Large-scale epidemiological and population data provide opportunities to identify subgroups of people who are at risk of disease or exposed to adverse environments. Clustering algorithms are popular data-driven tools to identify these subgroups; however, relying exclusively on algorithms may not produce the best results if the dataset does not have a clustered structure. For this reason, we propose a framework (the R-library Numero) that combines the self-organizing map algorithm, permutation analysis for statistical evidence and a final expert-driven subgrouping step. We used Numero to define subgroups in two examples without an obvious clustering structure: a biomedical dataset of kidney disease and another dataset of community-level socioeconomic indicators. We benchmarked the Numero subgroupings against popular clustering algorithms (principal components, K-means and hierarchical clustering). The Numero subgroupings were more intuitive and easier to interpret without losing mathematical quality. Therefore, we expect Numero to be useful for exploratory analyses of population-based epidemiological datasets.</abstract><cop>England</cop><pub>Oxford University Press</pub><pmid>29947762</pmid><doi>10.1093/ije/dyy113</doi><tpages>6</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0300-5771
ispartof	International journal of epidemiology, 2019-04, Vol.48 (2), p.369-374
issn	0300-5771 1464-3685
language	eng
recordid	cdi_proquest_miscellaneous_2060866581
source	Oxford University Press Journals All Titles (1996-Current); MEDLINE; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; Alma/SFX Local Collection
subjects	Algorithms Datasets as Topic Epidemiologic Methods Humans Statistics as Topic
title	Numero: a statistical framework to define multivariable subgroups in complex population-based datasets
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-02T20%3A49%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Numero:%20a%20statistical%20framework%20to%20define%20multivariable%20subgroups%20in%20complex%20population-based%20datasets&rft.jtitle=International%20journal%20of%20epidemiology&rft.au=Gao,%20Song&rft.date=2019-04-01&rft.volume=48&rft.issue=2&rft.spage=369&rft.epage=374&rft.pages=369-374&rft.issn=0300-5771&rft.eissn=1464-3685&rft_id=info:doi/10.1093/ije/dyy113&rft_dat=%3Cproquest_cross%3E2060866581%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2060866581&rft_id=info:pmid/29947762&rft_oup_id=10.1093/ije/dyy113&rfr_iscdi=true