Adaptive RAxML-NG: Accelerating Phylogenetic Inference under Maximum Likelihood using Dataset Difficulty

Abstract Phylogenetic inferences under the maximum likelihood criterion deploy heuristic tree search strategies to explore the vast search space. Depending on the input dataset, searches from different starting trees might all converge to a single tree topology. Often, though, distinct searches infe...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Molecular biology and evolution 2023-10, Vol.40 (10)
Hauptverfasser: Togkousidis, Anastasis, Kozlov, Oleksiy M, Haag, Julia, Höhler, Dimitri, Stamatakis, Alexandros
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue 10
container_start_page
container_title Molecular biology and evolution
container_volume 40
creator Togkousidis, Anastasis
Kozlov, Oleksiy M
Haag, Julia
Höhler, Dimitri
Stamatakis, Alexandros
description Abstract Phylogenetic inferences under the maximum likelihood criterion deploy heuristic tree search strategies to explore the vast search space. Depending on the input dataset, searches from different starting trees might all converge to a single tree topology. Often, though, distinct searches infer multiple topologies with large log-likelihood score differences or yield topologically highly distinct, yet almost equally likely, trees. Recently, Haag et al. introduced an approach to quantify, and implemented machine learning methods to predict, the dataset difficulty with respect to phylogenetic inference. Easy multiple sequence alignments (MSAs) exhibit a single likelihood peak on their likelihood surface, associated with a single tree topology to which most, if not all, independent searches rapidly converge. As difficulty increases, multiple locally optimal likelihood peaks emerge, yet from highly distinct topologies. To make use of this information, we introduce and implement an adaptive tree search heuristic in RAxML-NG, which modifies the thoroughness of the tree search strategy as a function of the predicted difficulty. Our adaptive strategy is based upon three observations. First, on easy datasets, searches converge rapidly and can hence be terminated at an earlier stage. Second, overanalyzing difficult datasets is hopeless, and thus it suffices to quickly infer only one of the numerous almost equally likely topologies to reduce overall execution time. Third, more extensive searches are justified and required on datasets with intermediate difficulty. While the likelihood surface exhibits multiple locally optimal peaks in this case, a small proportion of them is significantly better. Our experimental results for the adaptive heuristic on 9,515 empirical and 5,000 simulated datasets with varying difficulty exhibit substantial speedups, especially on easy and difficult datasets (53% of total MSAs), where we observe average speedups of more than 10×. Further, approximately 94% of the inferred trees using the adaptive strategy are statistically indistinguishable from the trees inferred under the standard strategy (RAxML-NG).
doi_str_mv 10.1093/molbev/msad227
format Article
fullrecord <record><control><sourceid>gale_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_10584362</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A775003535</galeid><oup_id>10.1093/molbev/msad227</oup_id><sourcerecordid>A775003535</sourcerecordid><originalsourceid>FETCH-LOGICAL-c396t-8524e958619ff6ed3585c407b4e77a9e4c60eb9ef3abb87eaf0a245efaad457a3</originalsourceid><addsrcrecordid>eNqFUcFu1DAQtRCILoUr5xzhkNaO7TjhgqIWSqUtIARna-KMdw2OvbWTVffvSbWrSpzQO8xo5r2n0TxC3jJ6wWjLL8foe9xfjhmGqlLPyIpJrkqmWPucrKhaekF5c0Ze5fybUiZEXb8kZ1w1VDBWr8i2G2A3uT0WP7qHu3X59eZD0RmDHhNMLmyK79uDjxsMODlT3AaLCYPBYg4DpuIOHtw4j8Xa_UHvtjEOxZwfVdcwQcapuHbWOjP76fCavLDgM7451XPy6_Onn1dfyvW3m9urbl0a3tZT2chKYCubmrXW1jhw2UgjqOoFKgUtClNT7Fu0HPq-UQiWQiUkWoBBSAX8nHw8-u7mfsTBYJgSeL1LboR00BGc_ncT3FZv4l4zKhvB62pxeHdySPF-xjzp0eXlIx4CxjnrqlGiqivJmoV6caRuwKN2wcbF0iwYcHQmBrRumXdKSUq5XPAkMCnmnNA-HcaofgxUHwPVp0AXwfujIM67_3H_ArtCpXE</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2874262518</pqid></control><display><type>article</type><title>Adaptive RAxML-NG: Accelerating Phylogenetic Inference under Maximum Likelihood using Dataset Difficulty</title><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>Oxford Journals Open Access Collection</source><source>PubMed Central</source><source>Alma/SFX Local Collection</source><source>Free Full-Text Journals in Chemistry</source><creator>Togkousidis, Anastasis ; Kozlov, Oleksiy M ; Haag, Julia ; Höhler, Dimitri ; Stamatakis, Alexandros</creator><contributor>Bonatto, Sandro</contributor><creatorcontrib>Togkousidis, Anastasis ; Kozlov, Oleksiy M ; Haag, Julia ; Höhler, Dimitri ; Stamatakis, Alexandros ; Bonatto, Sandro</creatorcontrib><description>Abstract Phylogenetic inferences under the maximum likelihood criterion deploy heuristic tree search strategies to explore the vast search space. Depending on the input dataset, searches from different starting trees might all converge to a single tree topology. Often, though, distinct searches infer multiple topologies with large log-likelihood score differences or yield topologically highly distinct, yet almost equally likely, trees. Recently, Haag et al. introduced an approach to quantify, and implemented machine learning methods to predict, the dataset difficulty with respect to phylogenetic inference. Easy multiple sequence alignments (MSAs) exhibit a single likelihood peak on their likelihood surface, associated with a single tree topology to which most, if not all, independent searches rapidly converge. As difficulty increases, multiple locally optimal likelihood peaks emerge, yet from highly distinct topologies. To make use of this information, we introduce and implement an adaptive tree search heuristic in RAxML-NG, which modifies the thoroughness of the tree search strategy as a function of the predicted difficulty. Our adaptive strategy is based upon three observations. First, on easy datasets, searches converge rapidly and can hence be terminated at an earlier stage. Second, overanalyzing difficult datasets is hopeless, and thus it suffices to quickly infer only one of the numerous almost equally likely topologies to reduce overall execution time. Third, more extensive searches are justified and required on datasets with intermediate difficulty. While the likelihood surface exhibits multiple locally optimal peaks in this case, a small proportion of them is significantly better. Our experimental results for the adaptive heuristic on 9,515 empirical and 5,000 simulated datasets with varying difficulty exhibit substantial speedups, especially on easy and difficult datasets (53% of total MSAs), where we observe average speedups of more than 10×. Further, approximately 94% of the inferred trees using the adaptive strategy are statistically indistinguishable from the trees inferred under the standard strategy (RAxML-NG).</description><identifier>ISSN: 0737-4038</identifier><identifier>EISSN: 1537-1719</identifier><identifier>DOI: 10.1093/molbev/msad227</identifier><identifier>PMID: 37804116</identifier><language>eng</language><publisher>US: Oxford University Press</publisher><subject>Machine learning ; Methods ; Phylogeny</subject><ispartof>Molecular biology and evolution, 2023-10, Vol.40 (10)</ispartof><rights>The Author(s) 2023. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution. 2023</rights><rights>COPYRIGHT 2023 Oxford University Press</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c396t-8524e958619ff6ed3585c407b4e77a9e4c60eb9ef3abb87eaf0a245efaad457a3</cites><orcidid>0000-0002-7493-3917 ; 0000-0002-4144-6709 ; 0000-0001-7394-2718 ; 0000-0003-0353-0691 ; 0000-0003-4306-3709</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC10584362/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC10584362/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,725,778,782,862,883,1601,27911,27912,53778,53780</link.rule.ids></links><search><contributor>Bonatto, Sandro</contributor><creatorcontrib>Togkousidis, Anastasis</creatorcontrib><creatorcontrib>Kozlov, Oleksiy M</creatorcontrib><creatorcontrib>Haag, Julia</creatorcontrib><creatorcontrib>Höhler, Dimitri</creatorcontrib><creatorcontrib>Stamatakis, Alexandros</creatorcontrib><title>Adaptive RAxML-NG: Accelerating Phylogenetic Inference under Maximum Likelihood using Dataset Difficulty</title><title>Molecular biology and evolution</title><description>Abstract Phylogenetic inferences under the maximum likelihood criterion deploy heuristic tree search strategies to explore the vast search space. Depending on the input dataset, searches from different starting trees might all converge to a single tree topology. Often, though, distinct searches infer multiple topologies with large log-likelihood score differences or yield topologically highly distinct, yet almost equally likely, trees. Recently, Haag et al. introduced an approach to quantify, and implemented machine learning methods to predict, the dataset difficulty with respect to phylogenetic inference. Easy multiple sequence alignments (MSAs) exhibit a single likelihood peak on their likelihood surface, associated with a single tree topology to which most, if not all, independent searches rapidly converge. As difficulty increases, multiple locally optimal likelihood peaks emerge, yet from highly distinct topologies. To make use of this information, we introduce and implement an adaptive tree search heuristic in RAxML-NG, which modifies the thoroughness of the tree search strategy as a function of the predicted difficulty. Our adaptive strategy is based upon three observations. First, on easy datasets, searches converge rapidly and can hence be terminated at an earlier stage. Second, overanalyzing difficult datasets is hopeless, and thus it suffices to quickly infer only one of the numerous almost equally likely topologies to reduce overall execution time. Third, more extensive searches are justified and required on datasets with intermediate difficulty. While the likelihood surface exhibits multiple locally optimal peaks in this case, a small proportion of them is significantly better. Our experimental results for the adaptive heuristic on 9,515 empirical and 5,000 simulated datasets with varying difficulty exhibit substantial speedups, especially on easy and difficult datasets (53% of total MSAs), where we observe average speedups of more than 10×. Further, approximately 94% of the inferred trees using the adaptive strategy are statistically indistinguishable from the trees inferred under the standard strategy (RAxML-NG).</description><subject>Machine learning</subject><subject>Methods</subject><subject>Phylogeny</subject><issn>0737-4038</issn><issn>1537-1719</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>TOX</sourceid><recordid>eNqFUcFu1DAQtRCILoUr5xzhkNaO7TjhgqIWSqUtIARna-KMdw2OvbWTVffvSbWrSpzQO8xo5r2n0TxC3jJ6wWjLL8foe9xfjhmGqlLPyIpJrkqmWPucrKhaekF5c0Ze5fybUiZEXb8kZ1w1VDBWr8i2G2A3uT0WP7qHu3X59eZD0RmDHhNMLmyK79uDjxsMODlT3AaLCYPBYg4DpuIOHtw4j8Xa_UHvtjEOxZwfVdcwQcapuHbWOjP76fCavLDgM7451XPy6_Onn1dfyvW3m9urbl0a3tZT2chKYCubmrXW1jhw2UgjqOoFKgUtClNT7Fu0HPq-UQiWQiUkWoBBSAX8nHw8-u7mfsTBYJgSeL1LboR00BGc_ncT3FZv4l4zKhvB62pxeHdySPF-xjzp0eXlIx4CxjnrqlGiqivJmoV6caRuwKN2wcbF0iwYcHQmBrRumXdKSUq5XPAkMCnmnNA-HcaofgxUHwPVp0AXwfujIM67_3H_ArtCpXE</recordid><startdate>20231004</startdate><enddate>20231004</enddate><creator>Togkousidis, Anastasis</creator><creator>Kozlov, Oleksiy M</creator><creator>Haag, Julia</creator><creator>Höhler, Dimitri</creator><creator>Stamatakis, Alexandros</creator><general>Oxford University Press</general><scope>TOX</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0002-7493-3917</orcidid><orcidid>https://orcid.org/0000-0002-4144-6709</orcidid><orcidid>https://orcid.org/0000-0001-7394-2718</orcidid><orcidid>https://orcid.org/0000-0003-0353-0691</orcidid><orcidid>https://orcid.org/0000-0003-4306-3709</orcidid></search><sort><creationdate>20231004</creationdate><title>Adaptive RAxML-NG: Accelerating Phylogenetic Inference under Maximum Likelihood using Dataset Difficulty</title><author>Togkousidis, Anastasis ; Kozlov, Oleksiy M ; Haag, Julia ; Höhler, Dimitri ; Stamatakis, Alexandros</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c396t-8524e958619ff6ed3585c407b4e77a9e4c60eb9ef3abb87eaf0a245efaad457a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Machine learning</topic><topic>Methods</topic><topic>Phylogeny</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Togkousidis, Anastasis</creatorcontrib><creatorcontrib>Kozlov, Oleksiy M</creatorcontrib><creatorcontrib>Haag, Julia</creatorcontrib><creatorcontrib>Höhler, Dimitri</creatorcontrib><creatorcontrib>Stamatakis, Alexandros</creatorcontrib><collection>Oxford Journals Open Access Collection</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Molecular biology and evolution</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Togkousidis, Anastasis</au><au>Kozlov, Oleksiy M</au><au>Haag, Julia</au><au>Höhler, Dimitri</au><au>Stamatakis, Alexandros</au><au>Bonatto, Sandro</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Adaptive RAxML-NG: Accelerating Phylogenetic Inference under Maximum Likelihood using Dataset Difficulty</atitle><jtitle>Molecular biology and evolution</jtitle><date>2023-10-04</date><risdate>2023</risdate><volume>40</volume><issue>10</issue><issn>0737-4038</issn><eissn>1537-1719</eissn><abstract>Abstract Phylogenetic inferences under the maximum likelihood criterion deploy heuristic tree search strategies to explore the vast search space. Depending on the input dataset, searches from different starting trees might all converge to a single tree topology. Often, though, distinct searches infer multiple topologies with large log-likelihood score differences or yield topologically highly distinct, yet almost equally likely, trees. Recently, Haag et al. introduced an approach to quantify, and implemented machine learning methods to predict, the dataset difficulty with respect to phylogenetic inference. Easy multiple sequence alignments (MSAs) exhibit a single likelihood peak on their likelihood surface, associated with a single tree topology to which most, if not all, independent searches rapidly converge. As difficulty increases, multiple locally optimal likelihood peaks emerge, yet from highly distinct topologies. To make use of this information, we introduce and implement an adaptive tree search heuristic in RAxML-NG, which modifies the thoroughness of the tree search strategy as a function of the predicted difficulty. Our adaptive strategy is based upon three observations. First, on easy datasets, searches converge rapidly and can hence be terminated at an earlier stage. Second, overanalyzing difficult datasets is hopeless, and thus it suffices to quickly infer only one of the numerous almost equally likely topologies to reduce overall execution time. Third, more extensive searches are justified and required on datasets with intermediate difficulty. While the likelihood surface exhibits multiple locally optimal peaks in this case, a small proportion of them is significantly better. Our experimental results for the adaptive heuristic on 9,515 empirical and 5,000 simulated datasets with varying difficulty exhibit substantial speedups, especially on easy and difficult datasets (53% of total MSAs), where we observe average speedups of more than 10×. Further, approximately 94% of the inferred trees using the adaptive strategy are statistically indistinguishable from the trees inferred under the standard strategy (RAxML-NG).</abstract><cop>US</cop><pub>Oxford University Press</pub><pmid>37804116</pmid><doi>10.1093/molbev/msad227</doi><orcidid>https://orcid.org/0000-0002-7493-3917</orcidid><orcidid>https://orcid.org/0000-0002-4144-6709</orcidid><orcidid>https://orcid.org/0000-0001-7394-2718</orcidid><orcidid>https://orcid.org/0000-0003-0353-0691</orcidid><orcidid>https://orcid.org/0000-0003-4306-3709</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0737-4038
ispartof Molecular biology and evolution, 2023-10, Vol.40 (10)
issn 0737-4038
1537-1719
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_10584362
source DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; Oxford Journals Open Access Collection; PubMed Central; Alma/SFX Local Collection; Free Full-Text Journals in Chemistry
subjects Machine learning
Methods
Phylogeny
title Adaptive RAxML-NG: Accelerating Phylogenetic Inference under Maximum Likelihood using Dataset Difficulty
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-15T22%3A12%3A19IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Adaptive%20RAxML-NG:%20Accelerating%20Phylogenetic%20Inference%20under%20Maximum%20Likelihood%20using%20Dataset%20Difficulty&rft.jtitle=Molecular%20biology%20and%20evolution&rft.au=Togkousidis,%20Anastasis&rft.date=2023-10-04&rft.volume=40&rft.issue=10&rft.issn=0737-4038&rft.eissn=1537-1719&rft_id=info:doi/10.1093/molbev/msad227&rft_dat=%3Cgale_pubme%3EA775003535%3C/gale_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2874262518&rft_id=info:pmid/37804116&rft_galeid=A775003535&rft_oup_id=10.1093/molbev/msad227&rfr_iscdi=true