Testing the rogue taxa hypothesis for clustering instability

Higlights•Instability in hierarchical trees measured using a novel tree distance.•Low tree consensus due to flaws in tree building algorithm and not rogue taxa.•Standard neighbor joining algorithm stability depends on the sample subset used.•Our novel bubble clustering method creates more stable hie...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of theoretical biology 2019-07, Vol.472, p.36-45
Hauptverfasser: Saunders, Amanda M., Ashlock, Daniel, Graether, Steffen P.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Higlights•Instability in hierarchical trees measured using a novel tree distance.•Low tree consensus due to flaws in tree building algorithm and not rogue taxa.•Standard neighbor joining algorithm stability depends on the sample subset used.•Our novel bubble clustering method creates more stable hierarchical trees. There have been longstanding concerns about the stability of hierarchical clustering. A suggested explanation for this instability is the presence of “rogue taxa”, i.e. taxa whose removal from a data set can apparently restore stability. In this study, the rogue taxa hypothesis is tested by partitioning a large data set into many smaller ones and checking for rogue behavior. The checking was performed with a standard hierarchical clustering algorithm and with a novel algorithm designed to have greater stability. It was found that rogue taxa cannot reasonably be said to exist because the status of being a rogue taxon depends on the data partition in which the taxon is embedded. In addition to the choice of data used, the choice of algorithm and algorithm parameters can have a large effect on the degree to which a taxon appears rogue. Instability in hierarchical clustering can be increased by problematic data points, but the status of data points being problematic depends not on their biological antecedents, but on their position in the local geometry of the data. The results of this study strongly suggest that instability in traditional hierarchical clustering routines is primarily a problem with the algorithm design.
ISSN:0022-5193
1095-8541
DOI:10.1016/j.jtbi.2019.04.002