Pathologies of Between-Groups Principal Components Analysis in Geometric Morphometrics
Good empirical applications of geometric morphometrics (GMM) typically involve several times more variables than specimens, a situation the statistician refers to as “high p / n ,” where p is the count of variables and n the count of specimens. This note calls your attention to two predictable catas...
Gespeichert in:
Veröffentlicht in: | Evolutionary biology 2019-12, Vol.46 (4), p.271-302 |
---|---|
1. Verfasser: | |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Good empirical applications of geometric morphometrics (GMM) typically involve several times more variables than specimens, a situation the statistician refers to as “high
p
/
n
,” where
p
is the count of variables and
n
the count of specimens. This note calls your attention to two predictable catastrophic failures of one particular multivariate statistical technique, between-groups principal components analysis (bgPCA), in this high-
p
/
n
setting. The more obvious pathology is this: when applied to the patternless (null) model of
p
identically distributed Gaussians over groups of the same size, both bgPCA and its algebraic equivalent, partial least squares (PLS) analysis against group, necessarily generate the appearance of huge equilateral group separations that are fictitious (absent from the statistical model). When specimen counts by group vary greatly or when any group includes fewer than about ten specimens, an even worse failure of the technique obtains: the smaller the group, the more likely a bgPCA is to fictitiously identify that group as the end-member of one of its derived axes. For these two reasons, when used in GMM and other high-
p
/
n
settings the bgPCA method very often leads to invalid or insecure biological inferences. This paper demonstrates and quantifies these and other pathological outcomes both for patternless models and for models with one or two valid factors, then offers suggestions for how GMM practitioners should protect themselves against the consequences for inference of these lamentably predictable misrepresentations. The bgPCA method should never be used unskeptically—it is always untrustworthy, never authoritative—and whenever it appears in partial support of any biological inference it must be accompanied by a wide range of diagnostic plots and other challenges, many of which are presented here for the first time. |
---|---|
ISSN: | 0071-3260 1934-2845 |
DOI: | 10.1007/s11692-019-09484-8 |