Testing the rogue taxa hypothesis for clustering instability
Higlights•Instability in hierarchical trees measured using a novel tree distance.•Low tree consensus due to flaws in tree building algorithm and not rogue taxa.•Standard neighbor joining algorithm stability depends on the sample subset used.•Our novel bubble clustering method creates more stable hie...
Gespeichert in:
Veröffentlicht in: | Journal of theoretical biology 2019-07, Vol.472, p.36-45 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Higlights•Instability in hierarchical trees measured using a novel tree distance.•Low tree consensus due to flaws in tree building algorithm and not rogue taxa.•Standard neighbor joining algorithm stability depends on the sample subset used.•Our novel bubble clustering method creates more stable hierarchical trees.
There have been longstanding concerns about the stability of hierarchical clustering. A suggested explanation for this instability is the presence of “rogue taxa”, i.e. taxa whose removal from a data set can apparently restore stability. In this study, the rogue taxa hypothesis is tested by partitioning a large data set into many smaller ones and checking for rogue behavior. The checking was performed with a standard hierarchical clustering algorithm and with a novel algorithm designed to have greater stability. It was found that rogue taxa cannot reasonably be said to exist because the status of being a rogue taxon depends on the data partition in which the taxon is embedded. In addition to the choice of data used, the choice of algorithm and algorithm parameters can have a large effect on the degree to which a taxon appears rogue. Instability in hierarchical clustering can be increased by problematic data points, but the status of data points being problematic depends not on their biological antecedents, but on their position in the local geometry of the data. The results of this study strongly suggest that instability in traditional hierarchical clustering routines is primarily a problem with the algorithm design. |
---|---|
ISSN: | 0022-5193 1095-8541 |
DOI: | 10.1016/j.jtbi.2019.04.002 |