Waste not, want not: why rarefying microbiome data is inadmissible

Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these appr...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	PLoS computational biology 2014-04, Vol.10 (4), p.e1003531-e1003531
Hauptverfasser:	McMurdie, Paul J, Holmes, Susan
Format:	Artikel
Sprache:	eng
Schlagworte:	Biology and Life Sciences Deoxyribonucleic acid DNA DNA sequencing Documentation Gene expression Generalized linear models Genetic research Measurement techniques Microbiota Models, Theoretical Nucleotide sequencing Physical Sciences RNA sequencing Sequence Analysis, DNA Sequence Analysis, RNA Statistical methods
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	e1003531
container_issue	4
container_start_page	e1003531
container_title	PLoS computational biology
container_volume	10
creator	McMurdie, Paul J Holmes, Susan
description	Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.
doi_str_mv	10.1371/journal.pcbi.1003531
format	Article
fullrecord	<record><control><sourceid>gale_plos_</sourceid><recordid>TN_cdi_plos_journals_1525299603</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A367799717</galeid><doaj_id>oai_doaj_org_article_23df78e998f94d62ab52c12b001f49e5</doaj_id><sourcerecordid>A367799717</sourcerecordid><originalsourceid>FETCH-LOGICAL-c605t-187629e62cdc680ccc51e93af6befcbf00509d93e89f97bd6065a4b0f7e092e93</originalsourceid><addsrcrecordid>eNqVkktv1DAQxyMEoqXwDRDkCBK7-BHbMQekUvFYqQKJhzhatjNOvUrsJU4o--1x2LTqHpEPHo1_8_e8iuIpRmtMBX69jdMQdLfeWePXGCHKKL5XnGLG6EpQVt-_Y58Uj1Lazkwt-cPihFRcSsLq0-LdT51GKEMcX5XXOoyz9aa8vtqXgx7A7X1oy97bIRofeygbPerSp9IH3fQ-JW86eFw8cLpL8GS5z4ofH95_v_i0uvzycXNxfrmyHLFxhWvBiQRObGN5jay1DIOk2nEDzhqHEEOykRRq6aQwDUec6cogJwBJksmz4vlBd9fFpJbyk8KMMCIlRzQTmwPRRL1Vu8H3etirqL3654hDq_QwetuBIrRxogYpayerhhNtGLGYGISwqySwrPV2-W0yPTQWwjjo7kj0-CX4K9XG34pKUfGKZIEXi8AQf02QRpUbZqHrdIA4zXljihiuqMjo-oC2Oqfmg4tZ0ebTQO59DOB89p9TLoSUAs8BL48CMjPCn7HVU0pq8-3rf7Cfj9nqwOaBp5Tnf1svRmreupu2q3nr1LJ1OezZ3V7dBt2sGf0LTEnT8Q</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1513051437</pqid></control><display><type>article</type><title>Waste not, want not: why rarefying microbiome data is inadmissible</title><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>Public Library of Science (PLoS) Journals Open Access</source><source>PubMed Central</source><creator>McMurdie, Paul J ; Holmes, Susan</creator><contributor>McHardy, Alice Carolyn</contributor><creatorcontrib>McMurdie, Paul J ; Holmes, Susan ; McHardy, Alice Carolyn</creatorcontrib><description>Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.</description><identifier>ISSN: 1553-7358</identifier><identifier>ISSN: 1553-734X</identifier><identifier>EISSN: 1553-7358</identifier><identifier>DOI: 10.1371/journal.pcbi.1003531</identifier><identifier>PMID: 24699258</identifier><language>eng</language><publisher>United States: Public Library of Science</publisher><subject>Biology and Life Sciences ; Deoxyribonucleic acid ; DNA ; DNA sequencing ; Documentation ; Gene expression ; Generalized linear models ; Genetic research ; Measurement techniques ; Microbiota ; Models, Theoretical ; Nucleotide sequencing ; Physical Sciences ; RNA sequencing ; Sequence Analysis, DNA ; Sequence Analysis, RNA ; Statistical methods</subject><ispartof>PLoS computational biology, 2014-04, Vol.10 (4), p.e1003531-e1003531</ispartof><rights>COPYRIGHT 2014 Public Library of Science</rights><rights>2014 McMurdie, Holmes 2014 McMurdie, Holmes</rights><rights>2014 McMurdie, Holmes. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited: McMurdie PJ, Holmes S (2014) Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible. PLoS Comput Biol 10(4): e1003531. doi:10.1371/journal.pcbi.1003531</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c605t-187629e62cdc680ccc51e93af6befcbf00509d93e89f97bd6065a4b0f7e092e93</citedby><cites>FETCH-LOGICAL-c605t-187629e62cdc680ccc51e93af6befcbf00509d93e89f97bd6065a4b0f7e092e93</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3974642/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3974642/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,315,728,781,785,865,886,2103,2929,23870,27928,27929,53795,53797</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/24699258$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><contributor>McHardy, Alice Carolyn</contributor><creatorcontrib>McMurdie, Paul J</creatorcontrib><creatorcontrib>Holmes, Susan</creatorcontrib><title>Waste not, want not: why rarefying microbiome data is inadmissible</title><title>PLoS computational biology</title><addtitle>PLoS Comput Biol</addtitle><description>Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.</description><subject>Biology and Life Sciences</subject><subject>Deoxyribonucleic acid</subject><subject>DNA</subject><subject>DNA sequencing</subject><subject>Documentation</subject><subject>Gene expression</subject><subject>Generalized linear models</subject><subject>Genetic research</subject><subject>Measurement techniques</subject><subject>Microbiota</subject><subject>Models, Theoretical</subject><subject>Nucleotide sequencing</subject><subject>Physical Sciences</subject><subject>RNA sequencing</subject><subject>Sequence Analysis, DNA</subject><subject>Sequence Analysis, RNA</subject><subject>Statistical methods</subject><issn>1553-7358</issn><issn>1553-734X</issn><issn>1553-7358</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2014</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><sourceid>DOA</sourceid><recordid>eNqVkktv1DAQxyMEoqXwDRDkCBK7-BHbMQekUvFYqQKJhzhatjNOvUrsJU4o--1x2LTqHpEPHo1_8_e8iuIpRmtMBX69jdMQdLfeWePXGCHKKL5XnGLG6EpQVt-_Y58Uj1Lazkwt-cPihFRcSsLq0-LdT51GKEMcX5XXOoyz9aa8vtqXgx7A7X1oy97bIRofeygbPerSp9IH3fQ-JW86eFw8cLpL8GS5z4ofH95_v_i0uvzycXNxfrmyHLFxhWvBiQRObGN5jay1DIOk2nEDzhqHEEOykRRq6aQwDUec6cogJwBJksmz4vlBd9fFpJbyk8KMMCIlRzQTmwPRRL1Vu8H3etirqL3654hDq_QwetuBIrRxogYpayerhhNtGLGYGISwqySwrPV2-W0yPTQWwjjo7kj0-CX4K9XG34pKUfGKZIEXi8AQf02QRpUbZqHrdIA4zXljihiuqMjo-oC2Oqfmg4tZ0ebTQO59DOB89p9TLoSUAs8BL48CMjPCn7HVU0pq8-3rf7Cfj9nqwOaBp5Tnf1svRmreupu2q3nr1LJ1OezZ3V7dBt2sGf0LTEnT8Q</recordid><startdate>20140401</startdate><enddate>20140401</enddate><creator>McMurdie, Paul J</creator><creator>Holmes, Susan</creator><general>Public Library of Science</general><general>Public Library of Science (PLoS)</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>ISN</scope><scope>ISR</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope></search><sort><creationdate>20140401</creationdate><title>Waste not, want not: why rarefying microbiome data is inadmissible</title><author>McMurdie, Paul J ; Holmes, Susan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c605t-187629e62cdc680ccc51e93af6befcbf00509d93e89f97bd6065a4b0f7e092e93</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2014</creationdate><topic>Biology and Life Sciences</topic><topic>Deoxyribonucleic acid</topic><topic>DNA</topic><topic>DNA sequencing</topic><topic>Documentation</topic><topic>Gene expression</topic><topic>Generalized linear models</topic><topic>Genetic research</topic><topic>Measurement techniques</topic><topic>Microbiota</topic><topic>Models, Theoretical</topic><topic>Nucleotide sequencing</topic><topic>Physical Sciences</topic><topic>RNA sequencing</topic><topic>Sequence Analysis, DNA</topic><topic>Sequence Analysis, RNA</topic><topic>Statistical methods</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>McMurdie, Paul J</creatorcontrib><creatorcontrib>Holmes, Susan</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Canada</collection><collection>Gale In Context: Science</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>PLoS computational biology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>McMurdie, Paul J</au><au>Holmes, Susan</au><au>McHardy, Alice Carolyn</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Waste not, want not: why rarefying microbiome data is inadmissible</atitle><jtitle>PLoS computational biology</jtitle><addtitle>PLoS Comput Biol</addtitle><date>2014-04-01</date><risdate>2014</risdate><volume>10</volume><issue>4</issue><spage>e1003531</spage><epage>e1003531</epage><pages>e1003531-e1003531</pages><issn>1553-7358</issn><issn>1553-734X</issn><eissn>1553-7358</eissn><abstract>Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.</abstract><cop>United States</cop><pub>Public Library of Science</pub><pmid>24699258</pmid><doi>10.1371/journal.pcbi.1003531</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1553-7358
ispartof	PLoS computational biology, 2014-04, Vol.10 (4), p.e1003531-e1003531
issn	1553-7358 1553-734X 1553-7358
language	eng
recordid	cdi_plos_journals_1525299603
source	MEDLINE; DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; Public Library of Science (PLoS) Journals Open Access; PubMed Central
subjects	Biology and Life Sciences Deoxyribonucleic acid DNA DNA sequencing Documentation Gene expression Generalized linear models Genetic research Measurement techniques Microbiota Models, Theoretical Nucleotide sequencing Physical Sciences RNA sequencing Sequence Analysis, DNA Sequence Analysis, RNA Statistical methods
title	Waste not, want not: why rarefying microbiome data is inadmissible
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-16T20%3A47%3A37IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_plos_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Waste%20not,%20want%20not:%20why%20rarefying%20microbiome%20data%20is%20inadmissible&rft.jtitle=PLoS%20computational%20biology&rft.au=McMurdie,%20Paul%20J&rft.date=2014-04-01&rft.volume=10&rft.issue=4&rft.spage=e1003531&rft.epage=e1003531&rft.pages=e1003531-e1003531&rft.issn=1553-7358&rft.eissn=1553-7358&rft_id=info:doi/10.1371/journal.pcbi.1003531&rft_dat=%3Cgale_plos_%3EA367799717%3C/gale_plos_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1513051437&rft_id=info:pmid/24699258&rft_galeid=A367799717&rft_doaj_id=oai_doaj_org_article_23df78e998f94d62ab52c12b001f49e5&rfr_iscdi=true