Spatial thinning and class balancing: Key choices lead to variation in the performance of species distribution models with citizen science data

Spatial biases are a common feature of presence–absence data from citizen scientists. Spatial thinning can mitigate errors in species distribution models (SDMs) that use these data. When detections or non‐detections are rare, however, SDMs may suffer from class imbalance or low sample size of the mi...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Methods in ecology and evolution 2021-02, Vol.12 (2), p.216-226
Hauptverfasser: Steen, Valerie A., Tingley, Morgan W., Paton, Peter W. C., Elphick, Chris S., McPherson, Jana
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 226
container_issue 2
container_start_page 216
container_title Methods in ecology and evolution
container_volume 12
creator Steen, Valerie A.
Tingley, Morgan W.
Paton, Peter W. C.
Elphick, Chris S.
McPherson, Jana
description Spatial biases are a common feature of presence–absence data from citizen scientists. Spatial thinning can mitigate errors in species distribution models (SDMs) that use these data. When detections or non‐detections are rare, however, SDMs may suffer from class imbalance or low sample size of the minority (i.e. rarer) class. Poor predictions can result, the severity of which may vary by modelling technique. To explore the consequences of spatial bias and class imbalance in presence–absence data, we used eBird citizen science data for 102 bird species from the northeastern USA to compare spatial thinning, class balancing and majority‐only thinning (i.e. retaining all samples of the minority class). We created SDMs using two parametric or semi‐parametric techniques (generalized linear models and generalized additive models) and two machine learning techniques (random forest and boosted regression trees). We tested the predictive abilities of these SDMs using an independent and systematically collected reference dataset with a combination of discrimination (area under the receiver operator characteristic curve; true skill statistic; area under the precision‐recall curve) and calibration (Brier score; Cohen's kappa) metrics. We found large variation in SDM performance depending on thinning and balancing decisions. Across all species, there was no single best approach, with the optimal choice of thinning and/or balancing depending on modelling technique, performance metric and the baseline sample prevalence of species in the data. Spatially thinning all the data was often a poor approach, especially for species with baseline sample prevalence
doi_str_mv 10.1111/2041-210X.13525
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2485992484</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2485992484</sourcerecordid><originalsourceid>FETCH-LOGICAL-c3575-9d4a0b51b180d9f52f9a7546e2eba68de631009eea25b17bd2c5a8f370f389b13</originalsourceid><addsrcrecordid>eNqFkE1LAzEQhhdRsNSevQY8t02ym3bXm5T6gRUPKngLs8msTdkma7K11D_hXzbbinhzDskwPE_CvElyzuiIxRpzmrEhZ_R1xFLBxVHS-50c_-lPk0EIKxorzQvKs17y9dRAa6Am7dJYa-wbAauJqiEEUkINVsXZJbnHHVFLZxQGUiNo0jryAd5E11libNSRNOgr59fRQeIqEhpUJvLahNabcrNH105jHcjWtEuiTGs-0ZIQsc7R0MJZclJBHXDwc_eTl-v58-x2uHi8uZtdLYYqFVMxLHQGtBSsZDnVRSV4VcBUZBPkWMIk1zhJGaUFInBRsmmpuRKQV-mUVnHzkqX95OLwbuPd-wZDK1du4238UvIsF0URzyxS4wOlvAvBYyUbb9bgd5JR2QUvu2hlF63cBx-NycHYmhp3_-HyYT5PD-I3DAaHVQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2485992484</pqid></control><display><type>article</type><title>Spatial thinning and class balancing: Key choices lead to variation in the performance of species distribution models with citizen science data</title><source>Wiley Online Library Journals Frontfile Complete</source><source>Alma/SFX Local Collection</source><creator>Steen, Valerie A. ; Tingley, Morgan W. ; Paton, Peter W. C. ; Elphick, Chris S. ; McPherson, Jana</creator><contributor>McPherson, Jana</contributor><creatorcontrib>Steen, Valerie A. ; Tingley, Morgan W. ; Paton, Peter W. C. ; Elphick, Chris S. ; McPherson, Jana ; McPherson, Jana</creatorcontrib><description>Spatial biases are a common feature of presence–absence data from citizen scientists. Spatial thinning can mitigate errors in species distribution models (SDMs) that use these data. When detections or non‐detections are rare, however, SDMs may suffer from class imbalance or low sample size of the minority (i.e. rarer) class. Poor predictions can result, the severity of which may vary by modelling technique. To explore the consequences of spatial bias and class imbalance in presence–absence data, we used eBird citizen science data for 102 bird species from the northeastern USA to compare spatial thinning, class balancing and majority‐only thinning (i.e. retaining all samples of the minority class). We created SDMs using two parametric or semi‐parametric techniques (generalized linear models and generalized additive models) and two machine learning techniques (random forest and boosted regression trees). We tested the predictive abilities of these SDMs using an independent and systematically collected reference dataset with a combination of discrimination (area under the receiver operator characteristic curve; true skill statistic; area under the precision‐recall curve) and calibration (Brier score; Cohen's kappa) metrics. We found large variation in SDM performance depending on thinning and balancing decisions. Across all species, there was no single best approach, with the optimal choice of thinning and/or balancing depending on modelling technique, performance metric and the baseline sample prevalence of species in the data. Spatially thinning all the data was often a poor approach, especially for species with baseline sample prevalence &lt;0.1. For most of these rare species, balancing classes improved model discrimination between presence and absence classes using machine learning techniques, but typically hindered model calibration. Baseline sample prevalence, sample size, modelling approach and the intended application of SDM output—whether discrimination or calibration—should guide decisions about how to thin or balance data, given the considerable influence of these methodological choices on SDM performance. For prognostic applications requiring good model calibration (vis‐à‐vis discrimination), the match between sample prevalence and true species prevalence may be the overriding feature and warrants further investigation.</description><identifier>ISSN: 2041-210X</identifier><identifier>EISSN: 2041-210X</identifier><identifier>DOI: 10.1111/2041-210X.13525</identifier><language>eng</language><publisher>London: John Wiley &amp; Sons, Inc</publisher><subject>Balancing ; Calibration ; class balancing ; Decision trees ; Discrimination ; eBird ; Generalized linear models ; Geographical distribution ; Learning algorithms ; Machine learning ; occurrence data ; presence–absence data ; prevalence ; Rare species ; Regression analysis ; Spatial data ; spatial thinning ; Statistical analysis ; Statistical methods ; Statistical models ; Thinning</subject><ispartof>Methods in ecology and evolution, 2021-02, Vol.12 (2), p.216-226</ispartof><rights>2020 British Ecological Society</rights><rights>Methods in Ecology and Evolution © 2021 British Ecological Society</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c3575-9d4a0b51b180d9f52f9a7546e2eba68de631009eea25b17bd2c5a8f370f389b13</citedby><cites>FETCH-LOGICAL-c3575-9d4a0b51b180d9f52f9a7546e2eba68de631009eea25b17bd2c5a8f370f389b13</cites><orcidid>0000-0002-1417-8139 ; 0000-0002-1477-2218</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://onlinelibrary.wiley.com/doi/pdf/10.1111%2F2041-210X.13525$$EPDF$$P50$$Gwiley$$H</linktopdf><linktohtml>$$Uhttps://onlinelibrary.wiley.com/doi/full/10.1111%2F2041-210X.13525$$EHTML$$P50$$Gwiley$$H</linktohtml><link.rule.ids>314,776,780,1411,27903,27904,45553,45554</link.rule.ids></links><search><contributor>McPherson, Jana</contributor><creatorcontrib>Steen, Valerie A.</creatorcontrib><creatorcontrib>Tingley, Morgan W.</creatorcontrib><creatorcontrib>Paton, Peter W. C.</creatorcontrib><creatorcontrib>Elphick, Chris S.</creatorcontrib><creatorcontrib>McPherson, Jana</creatorcontrib><title>Spatial thinning and class balancing: Key choices lead to variation in the performance of species distribution models with citizen science data</title><title>Methods in ecology and evolution</title><description>Spatial biases are a common feature of presence–absence data from citizen scientists. Spatial thinning can mitigate errors in species distribution models (SDMs) that use these data. When detections or non‐detections are rare, however, SDMs may suffer from class imbalance or low sample size of the minority (i.e. rarer) class. Poor predictions can result, the severity of which may vary by modelling technique. To explore the consequences of spatial bias and class imbalance in presence–absence data, we used eBird citizen science data for 102 bird species from the northeastern USA to compare spatial thinning, class balancing and majority‐only thinning (i.e. retaining all samples of the minority class). We created SDMs using two parametric or semi‐parametric techniques (generalized linear models and generalized additive models) and two machine learning techniques (random forest and boosted regression trees). We tested the predictive abilities of these SDMs using an independent and systematically collected reference dataset with a combination of discrimination (area under the receiver operator characteristic curve; true skill statistic; area under the precision‐recall curve) and calibration (Brier score; Cohen's kappa) metrics. We found large variation in SDM performance depending on thinning and balancing decisions. Across all species, there was no single best approach, with the optimal choice of thinning and/or balancing depending on modelling technique, performance metric and the baseline sample prevalence of species in the data. Spatially thinning all the data was often a poor approach, especially for species with baseline sample prevalence &lt;0.1. For most of these rare species, balancing classes improved model discrimination between presence and absence classes using machine learning techniques, but typically hindered model calibration. Baseline sample prevalence, sample size, modelling approach and the intended application of SDM output—whether discrimination or calibration—should guide decisions about how to thin or balance data, given the considerable influence of these methodological choices on SDM performance. For prognostic applications requiring good model calibration (vis‐à‐vis discrimination), the match between sample prevalence and true species prevalence may be the overriding feature and warrants further investigation.</description><subject>Balancing</subject><subject>Calibration</subject><subject>class balancing</subject><subject>Decision trees</subject><subject>Discrimination</subject><subject>eBird</subject><subject>Generalized linear models</subject><subject>Geographical distribution</subject><subject>Learning algorithms</subject><subject>Machine learning</subject><subject>occurrence data</subject><subject>presence–absence data</subject><subject>prevalence</subject><subject>Rare species</subject><subject>Regression analysis</subject><subject>Spatial data</subject><subject>spatial thinning</subject><subject>Statistical analysis</subject><subject>Statistical methods</subject><subject>Statistical models</subject><subject>Thinning</subject><issn>2041-210X</issn><issn>2041-210X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><recordid>eNqFkE1LAzEQhhdRsNSevQY8t02ym3bXm5T6gRUPKngLs8msTdkma7K11D_hXzbbinhzDskwPE_CvElyzuiIxRpzmrEhZ_R1xFLBxVHS-50c_-lPk0EIKxorzQvKs17y9dRAa6Am7dJYa-wbAauJqiEEUkINVsXZJbnHHVFLZxQGUiNo0jryAd5E11libNSRNOgr59fRQeIqEhpUJvLahNabcrNH105jHcjWtEuiTGs-0ZIQsc7R0MJZclJBHXDwc_eTl-v58-x2uHi8uZtdLYYqFVMxLHQGtBSsZDnVRSV4VcBUZBPkWMIk1zhJGaUFInBRsmmpuRKQV-mUVnHzkqX95OLwbuPd-wZDK1du4238UvIsF0URzyxS4wOlvAvBYyUbb9bgd5JR2QUvu2hlF63cBx-NycHYmhp3_-HyYT5PD-I3DAaHVQ</recordid><startdate>202102</startdate><enddate>202102</enddate><creator>Steen, Valerie A.</creator><creator>Tingley, Morgan W.</creator><creator>Paton, Peter W. C.</creator><creator>Elphick, Chris S.</creator><creator>McPherson, Jana</creator><general>John Wiley &amp; Sons, Inc</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7QG</scope><scope>7SN</scope><scope>8FD</scope><scope>C1K</scope><scope>FR3</scope><scope>P64</scope><scope>RC3</scope><orcidid>https://orcid.org/0000-0002-1417-8139</orcidid><orcidid>https://orcid.org/0000-0002-1477-2218</orcidid></search><sort><creationdate>202102</creationdate><title>Spatial thinning and class balancing: Key choices lead to variation in the performance of species distribution models with citizen science data</title><author>Steen, Valerie A. ; Tingley, Morgan W. ; Paton, Peter W. C. ; Elphick, Chris S. ; McPherson, Jana</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c3575-9d4a0b51b180d9f52f9a7546e2eba68de631009eea25b17bd2c5a8f370f389b13</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Balancing</topic><topic>Calibration</topic><topic>class balancing</topic><topic>Decision trees</topic><topic>Discrimination</topic><topic>eBird</topic><topic>Generalized linear models</topic><topic>Geographical distribution</topic><topic>Learning algorithms</topic><topic>Machine learning</topic><topic>occurrence data</topic><topic>presence–absence data</topic><topic>prevalence</topic><topic>Rare species</topic><topic>Regression analysis</topic><topic>Spatial data</topic><topic>spatial thinning</topic><topic>Statistical analysis</topic><topic>Statistical methods</topic><topic>Statistical models</topic><topic>Thinning</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Steen, Valerie A.</creatorcontrib><creatorcontrib>Tingley, Morgan W.</creatorcontrib><creatorcontrib>Paton, Peter W. C.</creatorcontrib><creatorcontrib>Elphick, Chris S.</creatorcontrib><creatorcontrib>McPherson, Jana</creatorcontrib><collection>CrossRef</collection><collection>Animal Behavior Abstracts</collection><collection>Ecology Abstracts</collection><collection>Technology Research Database</collection><collection>Environmental Sciences and Pollution Management</collection><collection>Engineering Research Database</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Genetics Abstracts</collection><jtitle>Methods in ecology and evolution</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Steen, Valerie A.</au><au>Tingley, Morgan W.</au><au>Paton, Peter W. C.</au><au>Elphick, Chris S.</au><au>McPherson, Jana</au><au>McPherson, Jana</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Spatial thinning and class balancing: Key choices lead to variation in the performance of species distribution models with citizen science data</atitle><jtitle>Methods in ecology and evolution</jtitle><date>2021-02</date><risdate>2021</risdate><volume>12</volume><issue>2</issue><spage>216</spage><epage>226</epage><pages>216-226</pages><issn>2041-210X</issn><eissn>2041-210X</eissn><abstract>Spatial biases are a common feature of presence–absence data from citizen scientists. Spatial thinning can mitigate errors in species distribution models (SDMs) that use these data. When detections or non‐detections are rare, however, SDMs may suffer from class imbalance or low sample size of the minority (i.e. rarer) class. Poor predictions can result, the severity of which may vary by modelling technique. To explore the consequences of spatial bias and class imbalance in presence–absence data, we used eBird citizen science data for 102 bird species from the northeastern USA to compare spatial thinning, class balancing and majority‐only thinning (i.e. retaining all samples of the minority class). We created SDMs using two parametric or semi‐parametric techniques (generalized linear models and generalized additive models) and two machine learning techniques (random forest and boosted regression trees). We tested the predictive abilities of these SDMs using an independent and systematically collected reference dataset with a combination of discrimination (area under the receiver operator characteristic curve; true skill statistic; area under the precision‐recall curve) and calibration (Brier score; Cohen's kappa) metrics. We found large variation in SDM performance depending on thinning and balancing decisions. Across all species, there was no single best approach, with the optimal choice of thinning and/or balancing depending on modelling technique, performance metric and the baseline sample prevalence of species in the data. Spatially thinning all the data was often a poor approach, especially for species with baseline sample prevalence &lt;0.1. For most of these rare species, balancing classes improved model discrimination between presence and absence classes using machine learning techniques, but typically hindered model calibration. Baseline sample prevalence, sample size, modelling approach and the intended application of SDM output—whether discrimination or calibration—should guide decisions about how to thin or balance data, given the considerable influence of these methodological choices on SDM performance. For prognostic applications requiring good model calibration (vis‐à‐vis discrimination), the match between sample prevalence and true species prevalence may be the overriding feature and warrants further investigation.</abstract><cop>London</cop><pub>John Wiley &amp; Sons, Inc</pub><doi>10.1111/2041-210X.13525</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0002-1417-8139</orcidid><orcidid>https://orcid.org/0000-0002-1477-2218</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2041-210X
ispartof Methods in ecology and evolution, 2021-02, Vol.12 (2), p.216-226
issn 2041-210X
2041-210X
language eng
recordid cdi_proquest_journals_2485992484
source Wiley Online Library Journals Frontfile Complete; Alma/SFX Local Collection
subjects Balancing
Calibration
class balancing
Decision trees
Discrimination
eBird
Generalized linear models
Geographical distribution
Learning algorithms
Machine learning
occurrence data
presence–absence data
prevalence
Rare species
Regression analysis
Spatial data
spatial thinning
Statistical analysis
Statistical methods
Statistical models
Thinning
title Spatial thinning and class balancing: Key choices lead to variation in the performance of species distribution models with citizen science data
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-25T14%3A49%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Spatial%20thinning%20and%20class%20balancing:%20Key%20choices%20lead%20to%20variation%20in%20the%20performance%20of%20species%20distribution%20models%20with%20citizen%20science%20data&rft.jtitle=Methods%20in%20ecology%20and%20evolution&rft.au=Steen,%20Valerie%20A.&rft.date=2021-02&rft.volume=12&rft.issue=2&rft.spage=216&rft.epage=226&rft.pages=216-226&rft.issn=2041-210X&rft.eissn=2041-210X&rft_id=info:doi/10.1111/2041-210X.13525&rft_dat=%3Cproquest_cross%3E2485992484%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2485992484&rft_id=info:pmid/&rfr_iscdi=true