The phylogenetic Kantorovich-Rubinstein metric for environmental sequence samples

It is now common to survey microbial communities by sequencing nucleic acid material extracted in bulk from a given environment. Comparative methods are needed that indicate the extent to which two communities differ given data sets of this type. UniFrac, which gives a somewhat ad hoc phylogenetics-...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of the Royal Statistical Society. Series B, Statistical methodology Statistical methodology, 2012-06, Vol.74 (3), p.569-592
Hauptverfasser: Evans, Steven N., Matsen, Frederick A.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:It is now common to survey microbial communities by sequencing nucleic acid material extracted in bulk from a given environment. Comparative methods are needed that indicate the extent to which two communities differ given data sets of this type. UniFrac, which gives a somewhat ad hoc phylogenetics-based distance between two communities, is one of the most commonly used tools for these analyses. We provide a foundation for such methods by establishing that, if we equate a metagenomic sample with its empirical distribution on a reference phylogenetic tree, then the weighted UniFrac distance between two samples is just the classical Kantorovich-Rubinstein, or earth mover's, distance between the corresponding empirical distributions. We demonstrate that this Kantorovich-Rubinstein distance and extensions incorporating uncertainty in the sample locations can be written as a readily computable integral over the tree, we develop L p Zolotarev-type generalizations of the metric, and we show how the p-value of the resulting natural permutation test of the null hypothesis 'no difference between two communities' can be approximated by using a Gaussian process functional. We relate the L²-case to an analysis-of-variance type of decomposition, finding that the distribution of its associated Gaussian functional is that of a computable linear combination of independent ${X_1}^2$ random variables.
ISSN:1369-7412
1467-9868
DOI:10.1111/j.1467-9868.2011.01018.x