Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure

Ecological data often show temporal, spatial, hierarchical (random effects), or phylogenetic structure. Modern statistical approaches are increasingly accounting for such dependencies. However, when performing cross-validation, these structures are regularly ignored, resulting in serious underestima...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Ecography (Copenhagen) 2017-08, Vol.40 (8), p.913-929
Hauptverfasser: Roberts, David R., Bahn, Volker, Ciuti, Simone, Boyce, Mark S., Elith, Jane, Guillera‐Arroita, Gurutzeta, Hauenstein, Severin, Lahoz‐Monfort, José J., Schröder, Boris, Thuiller, Wilfried, Warton, David I., Wintle, Brendan A., Hartig, Florian, Dormann, Carsten F.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 929
container_issue 8
container_start_page 913
container_title Ecography (Copenhagen)
container_volume 40
creator Roberts, David R.
Bahn, Volker
Ciuti, Simone
Boyce, Mark S.
Elith, Jane
Guillera‐Arroita, Gurutzeta
Hauenstein, Severin
Lahoz‐Monfort, José J.
Schröder, Boris
Thuiller, Wilfried
Warton, David I.
Wintle, Brendan A.
Hartig, Florian
Dormann, Carsten F.
description Ecological data often show temporal, spatial, hierarchical (random effects), or phylogenetic structure. Modern statistical approaches are increasingly accounting for such dependencies. However, when performing cross-validation, these structures are regularly ignored, resulting in serious underestimation of predictive error. One cause for the poor performance of uncorrected (random) cross-validation, noted often by modellers, are dependence structures in the data that persist as dependence structures in model residuals, violating the assumption of independence. Even more concerning, because often overlooked, is that structured data also provides ample opportunity for overfitting with non-causal predictors. This problem can persist even if remedies such as autoregressive models, generalized least squares, or mixed models are used. Block cross-validation, where data are split strategically rather than randomly, can address these issues. However, the blocking strategy must be carefully considered. Blocking in space, time, random effects or phylogenetic distance, while accounting for dependencies in the data, may also unwittingly induce extrapolations by restricting the ranges or combinations of predictor variables available for model training, thus overestimating interpolation errors. On the other hand, deliberate blocking in predictor space may also improve error estimates when extrapolation is the modelling goal. Here, we review the ecological literature on non-random and blocked cross-validation approaches. We also provide a series of simulations and case studies, in which we show that, for all instances tested, block cross-validation is nearly universally more appropriate than random cross-validation if the goal is predicting to new data or predictor space, or for selecting causal predictors. We recommend that block cross-validation be used wherever dependence structures exist in a dataset, even if no correlation structure is visible in the fitted model residuals, or if the fitted models account for such correlations.
doi_str_mv 10.1111/ecog.02881
format Article
fullrecord <record><control><sourceid>jstor_proqu</sourceid><recordid>TN_cdi_proquest_journals_1925803804</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><jstor_id>90011350</jstor_id><sourcerecordid>90011350</sourcerecordid><originalsourceid>FETCH-LOGICAL-c3991-1f4faa4c4910d0cb262fe1f3688d2d832fb2de1be640b9a7afb37cddc94afe7b3</originalsourceid><addsrcrecordid>eNp9kMtLw0AQxhdRsFYv3oWANzF1dvPaPUqoVSj0ouew2UezJe3G3Y2l_72JEY_OZR78vhnmQ-gWwwIP8aSE3S6AUIrP0AznADFktDhHM2CQx0XG4BJdeb8DwITldIZk6az38RdvjeTB2EPkg-NBbY3ykbYuGqY8OprQREHtO-t4-xj5bkDHojHKcScaI8ZuoLvm1NqtOqhgxLipF6F36hpdaN56dfOb5-jjZflevsbrzeqtfF7HImEMx1inmvNUpAyDBFGTnGiFdZJTKomkCdE1kQrXKk-hZrzguk4KIaVgKdeqqJM5up_2ds5-9sqHamd7dxhOVpiRjEJCIR2oh4kS4-tO6apzZs_dqcJQjS5Wo4vVj4sDDBN8NK06_UNWy3Kz-pXcTZKdD9b9SdjgOU4ySL4BUvSAmg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1925803804</pqid></control><display><type>article</type><title>Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure</title><source>Wiley Online Library Journals Frontfile Complete</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><creator>Roberts, David R. ; Bahn, Volker ; Ciuti, Simone ; Boyce, Mark S. ; Elith, Jane ; Guillera‐Arroita, Gurutzeta ; Hauenstein, Severin ; Lahoz‐Monfort, José J. ; Schröder, Boris ; Thuiller, Wilfried ; Warton, David I. ; Wintle, Brendan A. ; Hartig, Florian ; Dormann, Carsten F.</creator><creatorcontrib>Roberts, David R. ; Bahn, Volker ; Ciuti, Simone ; Boyce, Mark S. ; Elith, Jane ; Guillera‐Arroita, Gurutzeta ; Hauenstein, Severin ; Lahoz‐Monfort, José J. ; Schröder, Boris ; Thuiller, Wilfried ; Warton, David I. ; Wintle, Brendan A. ; Hartig, Florian ; Dormann, Carsten F.</creatorcontrib><description>Ecological data often show temporal, spatial, hierarchical (random effects), or phylogenetic structure. Modern statistical approaches are increasingly accounting for such dependencies. However, when performing cross-validation, these structures are regularly ignored, resulting in serious underestimation of predictive error. One cause for the poor performance of uncorrected (random) cross-validation, noted often by modellers, are dependence structures in the data that persist as dependence structures in model residuals, violating the assumption of independence. Even more concerning, because often overlooked, is that structured data also provides ample opportunity for overfitting with non-causal predictors. This problem can persist even if remedies such as autoregressive models, generalized least squares, or mixed models are used. Block cross-validation, where data are split strategically rather than randomly, can address these issues. However, the blocking strategy must be carefully considered. Blocking in space, time, random effects or phylogenetic distance, while accounting for dependencies in the data, may also unwittingly induce extrapolations by restricting the ranges or combinations of predictor variables available for model training, thus overestimating interpolation errors. On the other hand, deliberate blocking in predictor space may also improve error estimates when extrapolation is the modelling goal. Here, we review the ecological literature on non-random and blocked cross-validation approaches. We also provide a series of simulations and case studies, in which we show that, for all instances tested, block cross-validation is nearly universally more appropriate than random cross-validation if the goal is predicting to new data or predictor space, or for selecting causal predictors. We recommend that block cross-validation be used wherever dependence structures exist in a dataset, even if no correlation structure is visible in the fitted model residuals, or if the fitted models account for such correlations.</description><identifier>ISSN: 0906-7590</identifier><identifier>EISSN: 1600-0587</identifier><identifier>DOI: 10.1111/ecog.02881</identifier><language>eng</language><publisher>Oxford, UK: Nordic Society Oikos</publisher><subject>Autoregressive models ; Autoregressive processes ; Blocking ; Case studies ; Computer simulation ; Correlation ; Ecological effects ; Extrapolation ; Interpolation ; Least squares method ; Literature reviews ; Mathematical models ; Phylogenetics ; Phylogeny ; Review &amp; synthesis ; Structural hierarchy</subject><ispartof>Ecography (Copenhagen), 2017-08, Vol.40 (8), p.913-929</ispartof><rights>2016 Nordic Society Oikos</rights><rights>2016 The Authors</rights><rights>Ecography © 2017 Nordic Society Oikos</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c3991-1f4faa4c4910d0cb262fe1f3688d2d832fb2de1be640b9a7afb37cddc94afe7b3</citedby><cites>FETCH-LOGICAL-c3991-1f4faa4c4910d0cb262fe1f3688d2d832fb2de1be640b9a7afb37cddc94afe7b3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://onlinelibrary.wiley.com/doi/pdf/10.1111%2Fecog.02881$$EPDF$$P50$$Gwiley$$H</linktopdf><linktohtml>$$Uhttps://onlinelibrary.wiley.com/doi/full/10.1111%2Fecog.02881$$EHTML$$P50$$Gwiley$$H</linktohtml><link.rule.ids>314,776,780,1411,27901,27902,45550,45551</link.rule.ids></links><search><creatorcontrib>Roberts, David R.</creatorcontrib><creatorcontrib>Bahn, Volker</creatorcontrib><creatorcontrib>Ciuti, Simone</creatorcontrib><creatorcontrib>Boyce, Mark S.</creatorcontrib><creatorcontrib>Elith, Jane</creatorcontrib><creatorcontrib>Guillera‐Arroita, Gurutzeta</creatorcontrib><creatorcontrib>Hauenstein, Severin</creatorcontrib><creatorcontrib>Lahoz‐Monfort, José J.</creatorcontrib><creatorcontrib>Schröder, Boris</creatorcontrib><creatorcontrib>Thuiller, Wilfried</creatorcontrib><creatorcontrib>Warton, David I.</creatorcontrib><creatorcontrib>Wintle, Brendan A.</creatorcontrib><creatorcontrib>Hartig, Florian</creatorcontrib><creatorcontrib>Dormann, Carsten F.</creatorcontrib><title>Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure</title><title>Ecography (Copenhagen)</title><description>Ecological data often show temporal, spatial, hierarchical (random effects), or phylogenetic structure. Modern statistical approaches are increasingly accounting for such dependencies. However, when performing cross-validation, these structures are regularly ignored, resulting in serious underestimation of predictive error. One cause for the poor performance of uncorrected (random) cross-validation, noted often by modellers, are dependence structures in the data that persist as dependence structures in model residuals, violating the assumption of independence. Even more concerning, because often overlooked, is that structured data also provides ample opportunity for overfitting with non-causal predictors. This problem can persist even if remedies such as autoregressive models, generalized least squares, or mixed models are used. Block cross-validation, where data are split strategically rather than randomly, can address these issues. However, the blocking strategy must be carefully considered. Blocking in space, time, random effects or phylogenetic distance, while accounting for dependencies in the data, may also unwittingly induce extrapolations by restricting the ranges or combinations of predictor variables available for model training, thus overestimating interpolation errors. On the other hand, deliberate blocking in predictor space may also improve error estimates when extrapolation is the modelling goal. Here, we review the ecological literature on non-random and blocked cross-validation approaches. We also provide a series of simulations and case studies, in which we show that, for all instances tested, block cross-validation is nearly universally more appropriate than random cross-validation if the goal is predicting to new data or predictor space, or for selecting causal predictors. We recommend that block cross-validation be used wherever dependence structures exist in a dataset, even if no correlation structure is visible in the fitted model residuals, or if the fitted models account for such correlations.</description><subject>Autoregressive models</subject><subject>Autoregressive processes</subject><subject>Blocking</subject><subject>Case studies</subject><subject>Computer simulation</subject><subject>Correlation</subject><subject>Ecological effects</subject><subject>Extrapolation</subject><subject>Interpolation</subject><subject>Least squares method</subject><subject>Literature reviews</subject><subject>Mathematical models</subject><subject>Phylogenetics</subject><subject>Phylogeny</subject><subject>Review &amp; synthesis</subject><subject>Structural hierarchy</subject><issn>0906-7590</issn><issn>1600-0587</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2017</creationdate><recordtype>article</recordtype><recordid>eNp9kMtLw0AQxhdRsFYv3oWANzF1dvPaPUqoVSj0ouew2UezJe3G3Y2l_72JEY_OZR78vhnmQ-gWwwIP8aSE3S6AUIrP0AznADFktDhHM2CQx0XG4BJdeb8DwITldIZk6az38RdvjeTB2EPkg-NBbY3ykbYuGqY8OprQREHtO-t4-xj5bkDHojHKcScaI8ZuoLvm1NqtOqhgxLipF6F36hpdaN56dfOb5-jjZflevsbrzeqtfF7HImEMx1inmvNUpAyDBFGTnGiFdZJTKomkCdE1kQrXKk-hZrzguk4KIaVgKdeqqJM5up_2ds5-9sqHamd7dxhOVpiRjEJCIR2oh4kS4-tO6apzZs_dqcJQjS5Wo4vVj4sDDBN8NK06_UNWy3Kz-pXcTZKdD9b9SdjgOU4ySL4BUvSAmg</recordid><startdate>20170801</startdate><enddate>20170801</enddate><creator>Roberts, David R.</creator><creator>Bahn, Volker</creator><creator>Ciuti, Simone</creator><creator>Boyce, Mark S.</creator><creator>Elith, Jane</creator><creator>Guillera‐Arroita, Gurutzeta</creator><creator>Hauenstein, Severin</creator><creator>Lahoz‐Monfort, José J.</creator><creator>Schröder, Boris</creator><creator>Thuiller, Wilfried</creator><creator>Warton, David I.</creator><creator>Wintle, Brendan A.</creator><creator>Hartig, Florian</creator><creator>Dormann, Carsten F.</creator><general>Nordic Society Oikos</general><general>Blackwell Publishing Ltd</general><general>John Wiley &amp; Sons, Inc</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SN</scope><scope>7SS</scope><scope>C1K</scope></search><sort><creationdate>20170801</creationdate><title>Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure</title><author>Roberts, David R. ; Bahn, Volker ; Ciuti, Simone ; Boyce, Mark S. ; Elith, Jane ; Guillera‐Arroita, Gurutzeta ; Hauenstein, Severin ; Lahoz‐Monfort, José J. ; Schröder, Boris ; Thuiller, Wilfried ; Warton, David I. ; Wintle, Brendan A. ; Hartig, Florian ; Dormann, Carsten F.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c3991-1f4faa4c4910d0cb262fe1f3688d2d832fb2de1be640b9a7afb37cddc94afe7b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2017</creationdate><topic>Autoregressive models</topic><topic>Autoregressive processes</topic><topic>Blocking</topic><topic>Case studies</topic><topic>Computer simulation</topic><topic>Correlation</topic><topic>Ecological effects</topic><topic>Extrapolation</topic><topic>Interpolation</topic><topic>Least squares method</topic><topic>Literature reviews</topic><topic>Mathematical models</topic><topic>Phylogenetics</topic><topic>Phylogeny</topic><topic>Review &amp; synthesis</topic><topic>Structural hierarchy</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Roberts, David R.</creatorcontrib><creatorcontrib>Bahn, Volker</creatorcontrib><creatorcontrib>Ciuti, Simone</creatorcontrib><creatorcontrib>Boyce, Mark S.</creatorcontrib><creatorcontrib>Elith, Jane</creatorcontrib><creatorcontrib>Guillera‐Arroita, Gurutzeta</creatorcontrib><creatorcontrib>Hauenstein, Severin</creatorcontrib><creatorcontrib>Lahoz‐Monfort, José J.</creatorcontrib><creatorcontrib>Schröder, Boris</creatorcontrib><creatorcontrib>Thuiller, Wilfried</creatorcontrib><creatorcontrib>Warton, David I.</creatorcontrib><creatorcontrib>Wintle, Brendan A.</creatorcontrib><creatorcontrib>Hartig, Florian</creatorcontrib><creatorcontrib>Dormann, Carsten F.</creatorcontrib><collection>CrossRef</collection><collection>Ecology Abstracts</collection><collection>Entomology Abstracts (Full archive)</collection><collection>Environmental Sciences and Pollution Management</collection><jtitle>Ecography (Copenhagen)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Roberts, David R.</au><au>Bahn, Volker</au><au>Ciuti, Simone</au><au>Boyce, Mark S.</au><au>Elith, Jane</au><au>Guillera‐Arroita, Gurutzeta</au><au>Hauenstein, Severin</au><au>Lahoz‐Monfort, José J.</au><au>Schröder, Boris</au><au>Thuiller, Wilfried</au><au>Warton, David I.</au><au>Wintle, Brendan A.</au><au>Hartig, Florian</au><au>Dormann, Carsten F.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure</atitle><jtitle>Ecography (Copenhagen)</jtitle><date>2017-08-01</date><risdate>2017</risdate><volume>40</volume><issue>8</issue><spage>913</spage><epage>929</epage><pages>913-929</pages><issn>0906-7590</issn><eissn>1600-0587</eissn><abstract>Ecological data often show temporal, spatial, hierarchical (random effects), or phylogenetic structure. Modern statistical approaches are increasingly accounting for such dependencies. However, when performing cross-validation, these structures are regularly ignored, resulting in serious underestimation of predictive error. One cause for the poor performance of uncorrected (random) cross-validation, noted often by modellers, are dependence structures in the data that persist as dependence structures in model residuals, violating the assumption of independence. Even more concerning, because often overlooked, is that structured data also provides ample opportunity for overfitting with non-causal predictors. This problem can persist even if remedies such as autoregressive models, generalized least squares, or mixed models are used. Block cross-validation, where data are split strategically rather than randomly, can address these issues. However, the blocking strategy must be carefully considered. Blocking in space, time, random effects or phylogenetic distance, while accounting for dependencies in the data, may also unwittingly induce extrapolations by restricting the ranges or combinations of predictor variables available for model training, thus overestimating interpolation errors. On the other hand, deliberate blocking in predictor space may also improve error estimates when extrapolation is the modelling goal. Here, we review the ecological literature on non-random and blocked cross-validation approaches. We also provide a series of simulations and case studies, in which we show that, for all instances tested, block cross-validation is nearly universally more appropriate than random cross-validation if the goal is predicting to new data or predictor space, or for selecting causal predictors. We recommend that block cross-validation be used wherever dependence structures exist in a dataset, even if no correlation structure is visible in the fitted model residuals, or if the fitted models account for such correlations.</abstract><cop>Oxford, UK</cop><pub>Nordic Society Oikos</pub><doi>10.1111/ecog.02881</doi><tpages>17</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0906-7590
ispartof Ecography (Copenhagen), 2017-08, Vol.40 (8), p.913-929
issn 0906-7590
1600-0587
language eng
recordid cdi_proquest_journals_1925803804
source Wiley Online Library Journals Frontfile Complete; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals
subjects Autoregressive models
Autoregressive processes
Blocking
Case studies
Computer simulation
Correlation
Ecological effects
Extrapolation
Interpolation
Least squares method
Literature reviews
Mathematical models
Phylogenetics
Phylogeny
Review & synthesis
Structural hierarchy
title Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-21T17%3A17%3A57IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-jstor_proqu&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Cross-validation%20strategies%20for%20data%20with%20temporal,%20spatial,%20hierarchical,%20or%20phylogenetic%20structure&rft.jtitle=Ecography%20(Copenhagen)&rft.au=Roberts,%20David%20R.&rft.date=2017-08-01&rft.volume=40&rft.issue=8&rft.spage=913&rft.epage=929&rft.pages=913-929&rft.issn=0906-7590&rft.eissn=1600-0587&rft_id=info:doi/10.1111/ecog.02881&rft_dat=%3Cjstor_proqu%3E90011350%3C/jstor_proqu%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1925803804&rft_id=info:pmid/&rft_jstor_id=90011350&rfr_iscdi=true