Pathologies of Between-Groups Principal Components Analysis in Geometric Morphometrics

Good empirical applications of geometric morphometrics (GMM) typically involve several times more variables than specimens, a situation the statistician refers to as “high p / n ,” where p is the count of variables and n the count of specimens. This note calls your attention to two predictable catas...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Evolutionary biology 2019-12, Vol.46 (4), p.271-302
1. Verfasser: Bookstein, Fred L.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 302
container_issue 4
container_start_page 271
container_title Evolutionary biology
container_volume 46
creator Bookstein, Fred L.
description Good empirical applications of geometric morphometrics (GMM) typically involve several times more variables than specimens, a situation the statistician refers to as “high p / n ,” where p is the count of variables and n the count of specimens. This note calls your attention to two predictable catastrophic failures of one particular multivariate statistical technique, between-groups principal components analysis (bgPCA), in this high- p / n setting. The more obvious pathology is this: when applied to the patternless (null) model of p identically distributed Gaussians over groups of the same size, both bgPCA and its algebraic equivalent, partial least squares (PLS) analysis against group, necessarily generate the appearance of huge equilateral group separations that are fictitious (absent from the statistical model). When specimen counts by group vary greatly or when any group includes fewer than about ten specimens, an even worse failure of the technique obtains: the smaller the group, the more likely a bgPCA is to fictitiously identify that group as the end-member of one of its derived axes. For these two reasons, when used in GMM and other high- p / n settings the bgPCA method very often leads to invalid or insecure biological inferences. This paper demonstrates and quantifies these and other pathological outcomes both for patternless models and for models with one or two valid factors, then offers suggestions for how GMM practitioners should protect themselves against the consequences for inference of these lamentably predictable misrepresentations. The bgPCA method should never be used unskeptically—it is always untrustworthy, never authoritative—and whenever it appears in partial support of any biological inference it must be accompanied by a wide range of diagnostic plots and other challenges, many of which are presented here for the first time.
doi_str_mv 10.1007/s11692-019-09484-8
format Article
fullrecord <record><control><sourceid>gale_proqu</sourceid><recordid>TN_cdi_proquest_journals_2311218214</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A717808025</galeid><sourcerecordid>A717808025</sourcerecordid><originalsourceid>FETCH-LOGICAL-c496t-273d1f538020f7d29d17922d668cca92e2ef9c53eb71d9d4eb4bc9c871715b693</originalsourceid><addsrcrecordid>eNp9kU1LAzEQhoMoWKt_wNOC52gm2Y_kWItWoWIP6jVss7NtZHezJluk_95oC0UQyWFIeJ7JDC8hl8CugbHiJgDkilMGijKVypTKIzICJVLKZZodk1GEgAqes1NyFsI7Y5kohByRt0U5rF3jVhZD4urkFodPxI7OvNv0IVl42xnbl00ydW3vOuyGkEy6stkGGxLbJTN0LQ7emuTJ-X69v4RzclKXTcCLfR2T1_u7l-kDnT_PHqeTOTWpygfKC1FBnQnJOKuLiqsKCsV5lefSmFJx5FgrkwlcFlCpKsVlujTKyAIKyJa5EmNytevbe_exwTDod7fxcb6guQDgIDmkB2pVNqhtV7vBl6a1wehJbCVZ_D-L1PUfVDwVttbE3Wsb338JfCcY70LwWOve27b0Ww1Mf8eid7HoGIv-iUXLKImdFCLcrdAfJv7H-gJt1Y70</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2311218214</pqid></control><display><type>article</type><title>Pathologies of Between-Groups Principal Components Analysis in Geometric Morphometrics</title><source>SpringerNature Journals</source><creator>Bookstein, Fred L.</creator><creatorcontrib>Bookstein, Fred L.</creatorcontrib><description>Good empirical applications of geometric morphometrics (GMM) typically involve several times more variables than specimens, a situation the statistician refers to as “high p / n ,” where p is the count of variables and n the count of specimens. This note calls your attention to two predictable catastrophic failures of one particular multivariate statistical technique, between-groups principal components analysis (bgPCA), in this high- p / n setting. The more obvious pathology is this: when applied to the patternless (null) model of p identically distributed Gaussians over groups of the same size, both bgPCA and its algebraic equivalent, partial least squares (PLS) analysis against group, necessarily generate the appearance of huge equilateral group separations that are fictitious (absent from the statistical model). When specimen counts by group vary greatly or when any group includes fewer than about ten specimens, an even worse failure of the technique obtains: the smaller the group, the more likely a bgPCA is to fictitiously identify that group as the end-member of one of its derived axes. For these two reasons, when used in GMM and other high- p / n settings the bgPCA method very often leads to invalid or insecure biological inferences. This paper demonstrates and quantifies these and other pathological outcomes both for patternless models and for models with one or two valid factors, then offers suggestions for how GMM practitioners should protect themselves against the consequences for inference of these lamentably predictable misrepresentations. The bgPCA method should never be used unskeptically—it is always untrustworthy, never authoritative—and whenever it appears in partial support of any biological inference it must be accompanied by a wide range of diagnostic plots and other challenges, many of which are presented here for the first time.</description><identifier>ISSN: 0071-3260</identifier><identifier>EISSN: 1934-2845</identifier><identifier>DOI: 10.1007/s11692-019-09484-8</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Animal Genetics and Genomics ; Biomedical and Life Sciences ; Developmental Biology ; Ecology ; Evolutionary Biology ; Focal Reviews ; Human Genetics ; Life Sciences ; Mathematical models ; Morphometry ; Principal components analysis ; Statistical analysis</subject><ispartof>Evolutionary biology, 2019-12, Vol.46 (4), p.271-302</ispartof><rights>The Author(s) 2019</rights><rights>COPYRIGHT 2019 Springer</rights><rights>Evolutionary Biology is a copyright of Springer, (2019). All Rights Reserved. © 2019. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c496t-273d1f538020f7d29d17922d668cca92e2ef9c53eb71d9d4eb4bc9c871715b693</citedby><cites>FETCH-LOGICAL-c496t-273d1f538020f7d29d17922d668cca92e2ef9c53eb71d9d4eb4bc9c871715b693</cites><orcidid>0000-0003-2716-8471</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s11692-019-09484-8$$EPDF$$P50$$Gspringer$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s11692-019-09484-8$$EHTML$$P50$$Gspringer$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,27924,27925,41488,42557,51319</link.rule.ids></links><search><creatorcontrib>Bookstein, Fred L.</creatorcontrib><title>Pathologies of Between-Groups Principal Components Analysis in Geometric Morphometrics</title><title>Evolutionary biology</title><addtitle>Evol Biol</addtitle><description>Good empirical applications of geometric morphometrics (GMM) typically involve several times more variables than specimens, a situation the statistician refers to as “high p / n ,” where p is the count of variables and n the count of specimens. This note calls your attention to two predictable catastrophic failures of one particular multivariate statistical technique, between-groups principal components analysis (bgPCA), in this high- p / n setting. The more obvious pathology is this: when applied to the patternless (null) model of p identically distributed Gaussians over groups of the same size, both bgPCA and its algebraic equivalent, partial least squares (PLS) analysis against group, necessarily generate the appearance of huge equilateral group separations that are fictitious (absent from the statistical model). When specimen counts by group vary greatly or when any group includes fewer than about ten specimens, an even worse failure of the technique obtains: the smaller the group, the more likely a bgPCA is to fictitiously identify that group as the end-member of one of its derived axes. For these two reasons, when used in GMM and other high- p / n settings the bgPCA method very often leads to invalid or insecure biological inferences. This paper demonstrates and quantifies these and other pathological outcomes both for patternless models and for models with one or two valid factors, then offers suggestions for how GMM practitioners should protect themselves against the consequences for inference of these lamentably predictable misrepresentations. The bgPCA method should never be used unskeptically—it is always untrustworthy, never authoritative—and whenever it appears in partial support of any biological inference it must be accompanied by a wide range of diagnostic plots and other challenges, many of which are presented here for the first time.</description><subject>Animal Genetics and Genomics</subject><subject>Biomedical and Life Sciences</subject><subject>Developmental Biology</subject><subject>Ecology</subject><subject>Evolutionary Biology</subject><subject>Focal Reviews</subject><subject>Human Genetics</subject><subject>Life Sciences</subject><subject>Mathematical models</subject><subject>Morphometry</subject><subject>Principal components analysis</subject><subject>Statistical analysis</subject><issn>0071-3260</issn><issn>1934-2845</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>C6C</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNp9kU1LAzEQhoMoWKt_wNOC52gm2Y_kWItWoWIP6jVss7NtZHezJluk_95oC0UQyWFIeJ7JDC8hl8CugbHiJgDkilMGijKVypTKIzICJVLKZZodk1GEgAqes1NyFsI7Y5kohByRt0U5rF3jVhZD4urkFodPxI7OvNv0IVl42xnbl00ydW3vOuyGkEy6stkGGxLbJTN0LQ7emuTJ-X69v4RzclKXTcCLfR2T1_u7l-kDnT_PHqeTOTWpygfKC1FBnQnJOKuLiqsKCsV5lefSmFJx5FgrkwlcFlCpKsVlujTKyAIKyJa5EmNytevbe_exwTDod7fxcb6guQDgIDmkB2pVNqhtV7vBl6a1wehJbCVZ_D-L1PUfVDwVttbE3Wsb338JfCcY70LwWOve27b0Ww1Mf8eid7HoGIv-iUXLKImdFCLcrdAfJv7H-gJt1Y70</recordid><startdate>20191201</startdate><enddate>20191201</enddate><creator>Bookstein, Fred L.</creator><general>Springer US</general><general>Springer</general><general>Springer Nature B.V</general><scope>C6C</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>8FE</scope><scope>8FH</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>LK8</scope><scope>M7P</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><orcidid>https://orcid.org/0000-0003-2716-8471</orcidid></search><sort><creationdate>20191201</creationdate><title>Pathologies of Between-Groups Principal Components Analysis in Geometric Morphometrics</title><author>Bookstein, Fred L.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c496t-273d1f538020f7d29d17922d668cca92e2ef9c53eb71d9d4eb4bc9c871715b693</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Animal Genetics and Genomics</topic><topic>Biomedical and Life Sciences</topic><topic>Developmental Biology</topic><topic>Ecology</topic><topic>Evolutionary Biology</topic><topic>Focal Reviews</topic><topic>Human Genetics</topic><topic>Life Sciences</topic><topic>Mathematical models</topic><topic>Morphometry</topic><topic>Principal components analysis</topic><topic>Statistical analysis</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Bookstein, Fred L.</creatorcontrib><collection>Springer Nature OA Free Journals</collection><collection>CrossRef</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Biological Science Collection</collection><collection>Biological Science Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><jtitle>Evolutionary biology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Bookstein, Fred L.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Pathologies of Between-Groups Principal Components Analysis in Geometric Morphometrics</atitle><jtitle>Evolutionary biology</jtitle><stitle>Evol Biol</stitle><date>2019-12-01</date><risdate>2019</risdate><volume>46</volume><issue>4</issue><spage>271</spage><epage>302</epage><pages>271-302</pages><issn>0071-3260</issn><eissn>1934-2845</eissn><abstract>Good empirical applications of geometric morphometrics (GMM) typically involve several times more variables than specimens, a situation the statistician refers to as “high p / n ,” where p is the count of variables and n the count of specimens. This note calls your attention to two predictable catastrophic failures of one particular multivariate statistical technique, between-groups principal components analysis (bgPCA), in this high- p / n setting. The more obvious pathology is this: when applied to the patternless (null) model of p identically distributed Gaussians over groups of the same size, both bgPCA and its algebraic equivalent, partial least squares (PLS) analysis against group, necessarily generate the appearance of huge equilateral group separations that are fictitious (absent from the statistical model). When specimen counts by group vary greatly or when any group includes fewer than about ten specimens, an even worse failure of the technique obtains: the smaller the group, the more likely a bgPCA is to fictitiously identify that group as the end-member of one of its derived axes. For these two reasons, when used in GMM and other high- p / n settings the bgPCA method very often leads to invalid or insecure biological inferences. This paper demonstrates and quantifies these and other pathological outcomes both for patternless models and for models with one or two valid factors, then offers suggestions for how GMM practitioners should protect themselves against the consequences for inference of these lamentably predictable misrepresentations. The bgPCA method should never be used unskeptically—it is always untrustworthy, never authoritative—and whenever it appears in partial support of any biological inference it must be accompanied by a wide range of diagnostic plots and other challenges, many of which are presented here for the first time.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s11692-019-09484-8</doi><tpages>32</tpages><orcidid>https://orcid.org/0000-0003-2716-8471</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0071-3260
ispartof Evolutionary biology, 2019-12, Vol.46 (4), p.271-302
issn 0071-3260
1934-2845
language eng
recordid cdi_proquest_journals_2311218214
source SpringerNature Journals
subjects Animal Genetics and Genomics
Biomedical and Life Sciences
Developmental Biology
Ecology
Evolutionary Biology
Focal Reviews
Human Genetics
Life Sciences
Mathematical models
Morphometry
Principal components analysis
Statistical analysis
title Pathologies of Between-Groups Principal Components Analysis in Geometric Morphometrics
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-02T20%3A46%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_proqu&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Pathologies%20of%20Between-Groups%20Principal%20Components%20Analysis%20in%20Geometric%20Morphometrics&rft.jtitle=Evolutionary%20biology&rft.au=Bookstein,%20Fred%20L.&rft.date=2019-12-01&rft.volume=46&rft.issue=4&rft.spage=271&rft.epage=302&rft.pages=271-302&rft.issn=0071-3260&rft.eissn=1934-2845&rft_id=info:doi/10.1007/s11692-019-09484-8&rft_dat=%3Cgale_proqu%3EA717808025%3C/gale_proqu%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2311218214&rft_id=info:pmid/&rft_galeid=A717808025&rfr_iscdi=true