Finding Large Average Submatrices in High Dimensional Data

The search for sample-variable associations is an important problem in the exploratory analysis of high dimensional data. Biclustering methods search for sample-variable associations in the form of distinguished submatrices of the data matrix. (The rows and columns of a submatrix need not be contigu...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:The annals of applied statistics 2009-09, Vol.3 (3), p.985-1012
Hauptverfasser: Shabalin, Andrey A., Weigman, Victor J., Perou, Charles M., Nobel, Andrew B.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1012
container_issue 3
container_start_page 985
container_title The annals of applied statistics
container_volume 3
creator Shabalin, Andrey A.
Weigman, Victor J.
Perou, Charles M.
Nobel, Andrew B.
description The search for sample-variable associations is an important problem in the exploratory analysis of high dimensional data. Biclustering methods search for sample-variable associations in the form of distinguished submatrices of the data matrix. (The rows and columns of a submatrix need not be contiguous.) In this paper we propose and evaluate a statistically motivated biclustering procedure (LAS) that finds large average submatrices within a given real-valued data matrix. The procedure operates in an iterative-residual fashion, and is driven by a Bonferroni-based significance score that effectively trades off between submatrix size and average value. We examine the performance and potential utility of LAS, and compare it with a number of existing methods, through an extensive three-part validation study using two gene expression datasets. The validation study examines quantitative properties of biclusters, biological and clinical assessments using auxiliary information, and classification of disease subtypes using bicluster membership. In addition, we carry out a simulation study to assess the effectiveness and noise sensitivity of the LAS search procedure. These results suggest that LAS is an effective exploratory tool for the discovery of biologically relevant structures in high dimensional data.
doi_str_mv 10.1214/09-AOAS239
format Article
fullrecord <record><control><sourceid>jstor_proje</sourceid><recordid>TN_cdi_projecteuclid_primary_oai_CULeuclid_euclid_aoas_1254773275</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><jstor_id>30242874</jstor_id><sourcerecordid>30242874</sourcerecordid><originalsourceid>FETCH-LOGICAL-c377t-2ff2db98671da70c0f7a74d421119277c7af3cfe7028a54ec60ad9c9acb09e7c3</originalsourceid><addsrcrecordid>eNo9kM1Lw0AQxRdRsFYv3oVcvAjR_Uom6y2k1gqBHmrPYbrZrVvSpOymgv-9KQ09vWHmvd_AI-SR0VfGmXyjKs6X-YoLdUUmTEkWgxD0-jQLHqcsgVtyF8KO0kRmkk3I-9y1tWu3UYl-a6L813gcdHXc7LH3TpsQuTZauO1PNHN70wbXtdhEM-zxntxYbIJ5GHVK1vOP72IRl8vPryIvYy0A-phby-uNylJgNQLV1AKCrCVnjCkOoAGt0NYA5Rkm0uiUYq20Qr2hyoAWU5KfuQff7YzuzVE3rq4O3u3R_1UduqpYl-N2FOwwVIwnEkBwSAbGy5mhfReCN_YSZ7Q6NVdRVY3NDebn8SEGjY312GoXLgnOqUwoTwff09m3C33nL3dBueQZSPEPKtR3Pg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Finding Large Average Submatrices in High Dimensional Data</title><source>Jstor Complete Legacy</source><source>EZB-FREE-00999 freely available EZB journals</source><source>Project Euclid Complete</source><source>Alma/SFX Local Collection</source><source>JSTOR Mathematics &amp; Statistics</source><creator>Shabalin, Andrey A. ; Weigman, Victor J. ; Perou, Charles M. ; Nobel, Andrew B.</creator><creatorcontrib>Shabalin, Andrey A. ; Weigman, Victor J. ; Perou, Charles M. ; Nobel, Andrew B.</creatorcontrib><description>The search for sample-variable associations is an important problem in the exploratory analysis of high dimensional data. Biclustering methods search for sample-variable associations in the form of distinguished submatrices of the data matrix. (The rows and columns of a submatrix need not be contiguous.) In this paper we propose and evaluate a statistically motivated biclustering procedure (LAS) that finds large average submatrices within a given real-valued data matrix. The procedure operates in an iterative-residual fashion, and is driven by a Bonferroni-based significance score that effectively trades off between submatrix size and average value. We examine the performance and potential utility of LAS, and compare it with a number of existing methods, through an extensive three-part validation study using two gene expression datasets. The validation study examines quantitative properties of biclusters, biological and clinical assessments using auxiliary information, and classification of disease subtypes using bicluster membership. In addition, we carry out a simulation study to assess the effectiveness and noise sensitivity of the LAS search procedure. These results suggest that LAS is an effective exploratory tool for the discovery of biologically relevant structures in high dimensional data.</description><identifier>ISSN: 1932-6157</identifier><identifier>EISSN: 1941-7330</identifier><identifier>DOI: 10.1214/09-AOAS239</identifier><language>eng</language><publisher>Cleveland, OH: Institute of Mathematical Statistics</publisher><subject>Algorithms ; Arithmetic mean ; Biclustering ; Breast cancer ; Calculus of variations and optimal control ; classification ; Correlations ; Datasets ; Exact sciences and technology ; Gene expression ; Genes ; Information search ; lung cancer ; Mathematical analysis ; Mathematics ; microarray ; Multivariate analysis ; P values ; Probability and statistics ; Samba ; Sciences and techniques of general use ; Statistics</subject><ispartof>The annals of applied statistics, 2009-09, Vol.3 (3), p.985-1012</ispartof><rights>Copyright 2009 Institute of Mathematical Statistics</rights><rights>2015 INIST-CNRS</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c377t-2ff2db98671da70c0f7a74d421119277c7af3cfe7028a54ec60ad9c9acb09e7c3</citedby><cites>FETCH-LOGICAL-c377t-2ff2db98671da70c0f7a74d421119277c7af3cfe7028a54ec60ad9c9acb09e7c3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.jstor.org/stable/pdf/30242874$$EPDF$$P50$$Gjstor$$H</linktopdf><linktohtml>$$Uhttps://www.jstor.org/stable/30242874$$EHTML$$P50$$Gjstor$$H</linktohtml><link.rule.ids>230,314,776,780,799,828,881,921,27901,27902,57992,57996,58225,58229,79752,79760</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&amp;idt=22045026$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Shabalin, Andrey A.</creatorcontrib><creatorcontrib>Weigman, Victor J.</creatorcontrib><creatorcontrib>Perou, Charles M.</creatorcontrib><creatorcontrib>Nobel, Andrew B.</creatorcontrib><title>Finding Large Average Submatrices in High Dimensional Data</title><title>The annals of applied statistics</title><description>The search for sample-variable associations is an important problem in the exploratory analysis of high dimensional data. Biclustering methods search for sample-variable associations in the form of distinguished submatrices of the data matrix. (The rows and columns of a submatrix need not be contiguous.) In this paper we propose and evaluate a statistically motivated biclustering procedure (LAS) that finds large average submatrices within a given real-valued data matrix. The procedure operates in an iterative-residual fashion, and is driven by a Bonferroni-based significance score that effectively trades off between submatrix size and average value. We examine the performance and potential utility of LAS, and compare it with a number of existing methods, through an extensive three-part validation study using two gene expression datasets. The validation study examines quantitative properties of biclusters, biological and clinical assessments using auxiliary information, and classification of disease subtypes using bicluster membership. In addition, we carry out a simulation study to assess the effectiveness and noise sensitivity of the LAS search procedure. These results suggest that LAS is an effective exploratory tool for the discovery of biologically relevant structures in high dimensional data.</description><subject>Algorithms</subject><subject>Arithmetic mean</subject><subject>Biclustering</subject><subject>Breast cancer</subject><subject>Calculus of variations and optimal control</subject><subject>classification</subject><subject>Correlations</subject><subject>Datasets</subject><subject>Exact sciences and technology</subject><subject>Gene expression</subject><subject>Genes</subject><subject>Information search</subject><subject>lung cancer</subject><subject>Mathematical analysis</subject><subject>Mathematics</subject><subject>microarray</subject><subject>Multivariate analysis</subject><subject>P values</subject><subject>Probability and statistics</subject><subject>Samba</subject><subject>Sciences and techniques of general use</subject><subject>Statistics</subject><issn>1932-6157</issn><issn>1941-7330</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2009</creationdate><recordtype>article</recordtype><recordid>eNo9kM1Lw0AQxRdRsFYv3oVcvAjR_Uom6y2k1gqBHmrPYbrZrVvSpOymgv-9KQ09vWHmvd_AI-SR0VfGmXyjKs6X-YoLdUUmTEkWgxD0-jQLHqcsgVtyF8KO0kRmkk3I-9y1tWu3UYl-a6L813gcdHXc7LH3TpsQuTZauO1PNHN70wbXtdhEM-zxntxYbIJ5GHVK1vOP72IRl8vPryIvYy0A-phby-uNylJgNQLV1AKCrCVnjCkOoAGt0NYA5Rkm0uiUYq20Qr2hyoAWU5KfuQff7YzuzVE3rq4O3u3R_1UduqpYl-N2FOwwVIwnEkBwSAbGy5mhfReCN_YSZ7Q6NVdRVY3NDebn8SEGjY312GoXLgnOqUwoTwff09m3C33nL3dBueQZSPEPKtR3Pg</recordid><startdate>20090901</startdate><enddate>20090901</enddate><creator>Shabalin, Andrey A.</creator><creator>Weigman, Victor J.</creator><creator>Perou, Charles M.</creator><creator>Nobel, Andrew B.</creator><general>Institute of Mathematical Statistics</general><general>The Institute of Mathematical Statistics</general><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20090901</creationdate><title>Finding Large Average Submatrices in High Dimensional Data</title><author>Shabalin, Andrey A. ; Weigman, Victor J. ; Perou, Charles M. ; Nobel, Andrew B.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c377t-2ff2db98671da70c0f7a74d421119277c7af3cfe7028a54ec60ad9c9acb09e7c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2009</creationdate><topic>Algorithms</topic><topic>Arithmetic mean</topic><topic>Biclustering</topic><topic>Breast cancer</topic><topic>Calculus of variations and optimal control</topic><topic>classification</topic><topic>Correlations</topic><topic>Datasets</topic><topic>Exact sciences and technology</topic><topic>Gene expression</topic><topic>Genes</topic><topic>Information search</topic><topic>lung cancer</topic><topic>Mathematical analysis</topic><topic>Mathematics</topic><topic>microarray</topic><topic>Multivariate analysis</topic><topic>P values</topic><topic>Probability and statistics</topic><topic>Samba</topic><topic>Sciences and techniques of general use</topic><topic>Statistics</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Shabalin, Andrey A.</creatorcontrib><creatorcontrib>Weigman, Victor J.</creatorcontrib><creatorcontrib>Perou, Charles M.</creatorcontrib><creatorcontrib>Nobel, Andrew B.</creatorcontrib><collection>Pascal-Francis</collection><collection>CrossRef</collection><jtitle>The annals of applied statistics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Shabalin, Andrey A.</au><au>Weigman, Victor J.</au><au>Perou, Charles M.</au><au>Nobel, Andrew B.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Finding Large Average Submatrices in High Dimensional Data</atitle><jtitle>The annals of applied statistics</jtitle><date>2009-09-01</date><risdate>2009</risdate><volume>3</volume><issue>3</issue><spage>985</spage><epage>1012</epage><pages>985-1012</pages><issn>1932-6157</issn><eissn>1941-7330</eissn><abstract>The search for sample-variable associations is an important problem in the exploratory analysis of high dimensional data. Biclustering methods search for sample-variable associations in the form of distinguished submatrices of the data matrix. (The rows and columns of a submatrix need not be contiguous.) In this paper we propose and evaluate a statistically motivated biclustering procedure (LAS) that finds large average submatrices within a given real-valued data matrix. The procedure operates in an iterative-residual fashion, and is driven by a Bonferroni-based significance score that effectively trades off between submatrix size and average value. We examine the performance and potential utility of LAS, and compare it with a number of existing methods, through an extensive three-part validation study using two gene expression datasets. The validation study examines quantitative properties of biclusters, biological and clinical assessments using auxiliary information, and classification of disease subtypes using bicluster membership. In addition, we carry out a simulation study to assess the effectiveness and noise sensitivity of the LAS search procedure. These results suggest that LAS is an effective exploratory tool for the discovery of biologically relevant structures in high dimensional data.</abstract><cop>Cleveland, OH</cop><pub>Institute of Mathematical Statistics</pub><doi>10.1214/09-AOAS239</doi><tpages>28</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1932-6157
ispartof The annals of applied statistics, 2009-09, Vol.3 (3), p.985-1012
issn 1932-6157
1941-7330
language eng
recordid cdi_projecteuclid_primary_oai_CULeuclid_euclid_aoas_1254773275
source Jstor Complete Legacy; EZB-FREE-00999 freely available EZB journals; Project Euclid Complete; Alma/SFX Local Collection; JSTOR Mathematics & Statistics
subjects Algorithms
Arithmetic mean
Biclustering
Breast cancer
Calculus of variations and optimal control
classification
Correlations
Datasets
Exact sciences and technology
Gene expression
Genes
Information search
lung cancer
Mathematical analysis
Mathematics
microarray
Multivariate analysis
P values
Probability and statistics
Samba
Sciences and techniques of general use
Statistics
title Finding Large Average Submatrices in High Dimensional Data
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-21T18%3A03%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-jstor_proje&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Finding%20Large%20Average%20Submatrices%20in%20High%20Dimensional%20Data&rft.jtitle=The%20annals%20of%20applied%20statistics&rft.au=Shabalin,%20Andrey%20A.&rft.date=2009-09-01&rft.volume=3&rft.issue=3&rft.spage=985&rft.epage=1012&rft.pages=985-1012&rft.issn=1932-6157&rft.eissn=1941-7330&rft_id=info:doi/10.1214/09-AOAS239&rft_dat=%3Cjstor_proje%3E30242874%3C/jstor_proje%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_jstor_id=30242874&rfr_iscdi=true