To Include or Not to Include: The Impact of Gene Filtering on Species Tree Estimation Methods

With the increasing availability of whole genome data, many species trees are being constructed from hundreds to thousands of loci. Although concatenation analysis using maximum likelihood is a standard approach for estimating species trees, it does not account for gene tree heterogeneity, which can...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Systematic biology 2018-03, Vol.67 (2), p.285-303
Hauptverfasser: Molloy, Erin K., Warnow, Tandy
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 303
container_issue 2
container_start_page 285
container_title Systematic biology
container_volume 67
creator Molloy, Erin K.
Warnow, Tandy
description With the increasing availability of whole genome data, many species trees are being constructed from hundreds to thousands of loci. Although concatenation analysis using maximum likelihood is a standard approach for estimating species trees, it does not account for gene tree heterogeneity, which can occur due to many biological processes, such as incomplete lineage sorting. Coalescent species tree estimation methods, many of which are statistically consistent in the presence of incomplete lineage sorting, include Bayesian methods that coestimate the gene trees and the species tree, summary methods that compute the species tree by combining estimated gene trees, and site-based methods that infer the species tree from site patterns in the alignments of different loci. Due to concerns that poor quality loci will reduce the accuracy of estimated species trees, many recent phylogenomic studies have removed or filtered genes on the basis of phylogenetic signal and/or missing data prior to inferring species trees; little is known about the performance of species tree estimation methods when gene filtering is performed. We examine how incomplete lineage sorting, phylogenetic signal of individual loci, and missing data affect the absolute and the relative accuracy of species tree estimation methods and show how these properties affect methods’ responses to gene filtering strategies. In particular, summary methods (ASTRAL-II, ASTRID, and MP-EST), a site-based coalescent method (SVDquartets within PAUP*), and an unpartitioned concatenation analysis using maximum likelihood (RAxML) were evaluated on a heterogeneous collection of simulated multilocus data sets, and the following trends were observed. Filtering genes based on gene tree estimation error improved the accuracy of the summary methods when levels of incomplete lineage sorting were low to moderate but did not benefit the summary methods under higher levels of incomplete lineage sorting, unless gene tree estimation error was also extremely high (a model condition with few replicates). Neither SVDquartets nor concatenation analysis using RAxML benefited from filtering genes on the basis of gene tree estimation error. Finally, filtering genes based on missing data was either neutral (i.e., did not impact accuracy) or else reduced the accuracy of all five methods. By providing insight into the consequences of gene filtering, we offer recommendations for estimating species tree in the presence of incomplete lineag
doi_str_mv 10.1093/sysbio/syx077
format Article
fullrecord <record><control><sourceid>jstor_proqu</sourceid><recordid>TN_cdi_proquest_miscellaneous_1951417243</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><jstor_id>26581937</jstor_id><oup_id>10.1093/sysbio/syx077</oup_id><sourcerecordid>26581937</sourcerecordid><originalsourceid>FETCH-LOGICAL-c431t-4b97414d42bfb2bea847191bb1554277749f7c21299e52cdf3a361951d392a703</originalsourceid><addsrcrecordid>eNqFkE1LAzEQQIMotlaP3lT26GU1k0l2NkcpfhSKXip4C5vdLLRsm5rsgv33tmxtj55mGB5v4DF2DfwBuMbHuIl27rfjhxOdsCFwytIcs6_T3Z5hqkDRgF3EuOAcIFNwzgZCc6ER8yG7mflksiqbrnKJD8m7b5P2cLlkZ3XRRHe1nyP2-fI8G7-l04_XyfhpmpYSoU2l1SRBVlLY2grrilwSaLAWlJKCiKSuqRQgtHZKlFWNBWagFVSoRUEcR-y-966D_-5cbM1yHkvXNMXK-S6aHSuBhMQtmvZoGXyMwdVmHebLImwMcLMLYvogpg-y5e_26s4uXXWg_wocf_tu_a_rtkcXsfXhqMpUDhoJfwHtTXKe</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1951417243</pqid></control><display><type>article</type><title>To Include or Not to Include: The Impact of Gene Filtering on Species Tree Estimation Methods</title><source>Jstor Complete Legacy</source><source>Oxford University Press Journals All Titles (1996-Current)</source><source>MEDLINE</source><source>Alma/SFX Local Collection</source><creator>Molloy, Erin K. ; Warnow, Tandy</creator><creatorcontrib>Molloy, Erin K. ; Warnow, Tandy</creatorcontrib><description>With the increasing availability of whole genome data, many species trees are being constructed from hundreds to thousands of loci. Although concatenation analysis using maximum likelihood is a standard approach for estimating species trees, it does not account for gene tree heterogeneity, which can occur due to many biological processes, such as incomplete lineage sorting. Coalescent species tree estimation methods, many of which are statistically consistent in the presence of incomplete lineage sorting, include Bayesian methods that coestimate the gene trees and the species tree, summary methods that compute the species tree by combining estimated gene trees, and site-based methods that infer the species tree from site patterns in the alignments of different loci. Due to concerns that poor quality loci will reduce the accuracy of estimated species trees, many recent phylogenomic studies have removed or filtered genes on the basis of phylogenetic signal and/or missing data prior to inferring species trees; little is known about the performance of species tree estimation methods when gene filtering is performed. We examine how incomplete lineage sorting, phylogenetic signal of individual loci, and missing data affect the absolute and the relative accuracy of species tree estimation methods and show how these properties affect methods’ responses to gene filtering strategies. In particular, summary methods (ASTRAL-II, ASTRID, and MP-EST), a site-based coalescent method (SVDquartets within PAUP*), and an unpartitioned concatenation analysis using maximum likelihood (RAxML) were evaluated on a heterogeneous collection of simulated multilocus data sets, and the following trends were observed. Filtering genes based on gene tree estimation error improved the accuracy of the summary methods when levels of incomplete lineage sorting were low to moderate but did not benefit the summary methods under higher levels of incomplete lineage sorting, unless gene tree estimation error was also extremely high (a model condition with few replicates). Neither SVDquartets nor concatenation analysis using RAxML benefited from filtering genes on the basis of gene tree estimation error. Finally, filtering genes based on missing data was either neutral (i.e., did not impact accuracy) or else reduced the accuracy of all five methods. By providing insight into the consequences of gene filtering, we offer recommendations for estimating species tree in the presence of incomplete lineage sorting and reconcile seemingly conflicting observations made in prior studies regarding the impact of gene filtering.</description><identifier>ISSN: 1063-5157</identifier><identifier>EISSN: 1076-836X</identifier><identifier>DOI: 10.1093/sysbio/syx077</identifier><identifier>PMID: 29029338</identifier><language>eng</language><publisher>England: Oxford University Press</publisher><subject>Classification - methods ; Computer Simulation ; Genetic Speciation ; Genomics ; Models, Genetic ; Phylogeny ; REGULAR ARTICLES ; Sequence Analysis</subject><ispartof>Systematic biology, 2018-03, Vol.67 (2), p.285-303</ispartof><rights>The Author(s) 2017</rights><rights>The Author(s) 2017. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com 2017</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c431t-4b97414d42bfb2bea847191bb1554277749f7c21299e52cdf3a361951d392a703</citedby><cites>FETCH-LOGICAL-c431t-4b97414d42bfb2bea847191bb1554277749f7c21299e52cdf3a361951d392a703</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.jstor.org/stable/pdf/26581937$$EPDF$$P50$$Gjstor$$H</linktopdf><linktohtml>$$Uhttps://www.jstor.org/stable/26581937$$EHTML$$P50$$Gjstor$$H</linktohtml><link.rule.ids>314,776,780,799,1578,27903,27904,57995,58228</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/29029338$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Molloy, Erin K.</creatorcontrib><creatorcontrib>Warnow, Tandy</creatorcontrib><title>To Include or Not to Include: The Impact of Gene Filtering on Species Tree Estimation Methods</title><title>Systematic biology</title><addtitle>Syst Biol</addtitle><description>With the increasing availability of whole genome data, many species trees are being constructed from hundreds to thousands of loci. Although concatenation analysis using maximum likelihood is a standard approach for estimating species trees, it does not account for gene tree heterogeneity, which can occur due to many biological processes, such as incomplete lineage sorting. Coalescent species tree estimation methods, many of which are statistically consistent in the presence of incomplete lineage sorting, include Bayesian methods that coestimate the gene trees and the species tree, summary methods that compute the species tree by combining estimated gene trees, and site-based methods that infer the species tree from site patterns in the alignments of different loci. Due to concerns that poor quality loci will reduce the accuracy of estimated species trees, many recent phylogenomic studies have removed or filtered genes on the basis of phylogenetic signal and/or missing data prior to inferring species trees; little is known about the performance of species tree estimation methods when gene filtering is performed. We examine how incomplete lineage sorting, phylogenetic signal of individual loci, and missing data affect the absolute and the relative accuracy of species tree estimation methods and show how these properties affect methods’ responses to gene filtering strategies. In particular, summary methods (ASTRAL-II, ASTRID, and MP-EST), a site-based coalescent method (SVDquartets within PAUP*), and an unpartitioned concatenation analysis using maximum likelihood (RAxML) were evaluated on a heterogeneous collection of simulated multilocus data sets, and the following trends were observed. Filtering genes based on gene tree estimation error improved the accuracy of the summary methods when levels of incomplete lineage sorting were low to moderate but did not benefit the summary methods under higher levels of incomplete lineage sorting, unless gene tree estimation error was also extremely high (a model condition with few replicates). Neither SVDquartets nor concatenation analysis using RAxML benefited from filtering genes on the basis of gene tree estimation error. Finally, filtering genes based on missing data was either neutral (i.e., did not impact accuracy) or else reduced the accuracy of all five methods. By providing insight into the consequences of gene filtering, we offer recommendations for estimating species tree in the presence of incomplete lineage sorting and reconcile seemingly conflicting observations made in prior studies regarding the impact of gene filtering.</description><subject>Classification - methods</subject><subject>Computer Simulation</subject><subject>Genetic Speciation</subject><subject>Genomics</subject><subject>Models, Genetic</subject><subject>Phylogeny</subject><subject>REGULAR ARTICLES</subject><subject>Sequence Analysis</subject><issn>1063-5157</issn><issn>1076-836X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNqFkE1LAzEQQIMotlaP3lT26GU1k0l2NkcpfhSKXip4C5vdLLRsm5rsgv33tmxtj55mGB5v4DF2DfwBuMbHuIl27rfjhxOdsCFwytIcs6_T3Z5hqkDRgF3EuOAcIFNwzgZCc6ER8yG7mflksiqbrnKJD8m7b5P2cLlkZ3XRRHe1nyP2-fI8G7-l04_XyfhpmpYSoU2l1SRBVlLY2grrilwSaLAWlJKCiKSuqRQgtHZKlFWNBWagFVSoRUEcR-y-966D_-5cbM1yHkvXNMXK-S6aHSuBhMQtmvZoGXyMwdVmHebLImwMcLMLYvogpg-y5e_26s4uXXWg_wocf_tu_a_rtkcXsfXhqMpUDhoJfwHtTXKe</recordid><startdate>20180301</startdate><enddate>20180301</enddate><creator>Molloy, Erin K.</creator><creator>Warnow, Tandy</creator><general>Oxford University Press</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope></search><sort><creationdate>20180301</creationdate><title>To Include or Not to Include</title><author>Molloy, Erin K. ; Warnow, Tandy</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c431t-4b97414d42bfb2bea847191bb1554277749f7c21299e52cdf3a361951d392a703</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Classification - methods</topic><topic>Computer Simulation</topic><topic>Genetic Speciation</topic><topic>Genomics</topic><topic>Models, Genetic</topic><topic>Phylogeny</topic><topic>REGULAR ARTICLES</topic><topic>Sequence Analysis</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Molloy, Erin K.</creatorcontrib><creatorcontrib>Warnow, Tandy</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>Systematic biology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Molloy, Erin K.</au><au>Warnow, Tandy</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>To Include or Not to Include: The Impact of Gene Filtering on Species Tree Estimation Methods</atitle><jtitle>Systematic biology</jtitle><addtitle>Syst Biol</addtitle><date>2018-03-01</date><risdate>2018</risdate><volume>67</volume><issue>2</issue><spage>285</spage><epage>303</epage><pages>285-303</pages><issn>1063-5157</issn><eissn>1076-836X</eissn><abstract>With the increasing availability of whole genome data, many species trees are being constructed from hundreds to thousands of loci. Although concatenation analysis using maximum likelihood is a standard approach for estimating species trees, it does not account for gene tree heterogeneity, which can occur due to many biological processes, such as incomplete lineage sorting. Coalescent species tree estimation methods, many of which are statistically consistent in the presence of incomplete lineage sorting, include Bayesian methods that coestimate the gene trees and the species tree, summary methods that compute the species tree by combining estimated gene trees, and site-based methods that infer the species tree from site patterns in the alignments of different loci. Due to concerns that poor quality loci will reduce the accuracy of estimated species trees, many recent phylogenomic studies have removed or filtered genes on the basis of phylogenetic signal and/or missing data prior to inferring species trees; little is known about the performance of species tree estimation methods when gene filtering is performed. We examine how incomplete lineage sorting, phylogenetic signal of individual loci, and missing data affect the absolute and the relative accuracy of species tree estimation methods and show how these properties affect methods’ responses to gene filtering strategies. In particular, summary methods (ASTRAL-II, ASTRID, and MP-EST), a site-based coalescent method (SVDquartets within PAUP*), and an unpartitioned concatenation analysis using maximum likelihood (RAxML) were evaluated on a heterogeneous collection of simulated multilocus data sets, and the following trends were observed. Filtering genes based on gene tree estimation error improved the accuracy of the summary methods when levels of incomplete lineage sorting were low to moderate but did not benefit the summary methods under higher levels of incomplete lineage sorting, unless gene tree estimation error was also extremely high (a model condition with few replicates). Neither SVDquartets nor concatenation analysis using RAxML benefited from filtering genes on the basis of gene tree estimation error. Finally, filtering genes based on missing data was either neutral (i.e., did not impact accuracy) or else reduced the accuracy of all five methods. By providing insight into the consequences of gene filtering, we offer recommendations for estimating species tree in the presence of incomplete lineage sorting and reconcile seemingly conflicting observations made in prior studies regarding the impact of gene filtering.</abstract><cop>England</cop><pub>Oxford University Press</pub><pmid>29029338</pmid><doi>10.1093/sysbio/syx077</doi><tpages>19</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1063-5157
ispartof Systematic biology, 2018-03, Vol.67 (2), p.285-303
issn 1063-5157
1076-836X
language eng
recordid cdi_proquest_miscellaneous_1951417243
source Jstor Complete Legacy; Oxford University Press Journals All Titles (1996-Current); MEDLINE; Alma/SFX Local Collection
subjects Classification - methods
Computer Simulation
Genetic Speciation
Genomics
Models, Genetic
Phylogeny
REGULAR ARTICLES
Sequence Analysis
title To Include or Not to Include: The Impact of Gene Filtering on Species Tree Estimation Methods
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-27T03%3A08%3A28IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-jstor_proqu&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=To%20Include%20or%20Not%20to%20Include:%20The%20Impact%20of%20Gene%20Filtering%20on%20Species%20Tree%20Estimation%20Methods&rft.jtitle=Systematic%20biology&rft.au=Molloy,%20Erin%20K.&rft.date=2018-03-01&rft.volume=67&rft.issue=2&rft.spage=285&rft.epage=303&rft.pages=285-303&rft.issn=1063-5157&rft.eissn=1076-836X&rft_id=info:doi/10.1093/sysbio/syx077&rft_dat=%3Cjstor_proqu%3E26581937%3C/jstor_proqu%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1951417243&rft_id=info:pmid/29029338&rft_jstor_id=26581937&rft_oup_id=10.1093/sysbio/syx077&rfr_iscdi=true