Detecting outliers in species distribution data

Aim: Species distribution data play a pivotal role in the study of ecology, evolution, biogeography and biodiversity conservation. Although large amounts of location data are available and accessible from public databases, data quality remains problematic. Of the potential sources of error, position...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of biogeography 2018-01, Vol.45 (1), p.164-176
Hauptverfasser: Liu, Canran, White, Matt, Newell, Graeme
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 176
container_issue 1
container_start_page 164
container_title Journal of biogeography
container_volume 45
creator Liu, Canran
White, Matt
Newell, Graeme
description Aim: Species distribution data play a pivotal role in the study of ecology, evolution, biogeography and biodiversity conservation. Although large amounts of location data are available and accessible from public databases, data quality remains problematic. Of the potential sources of error, positional errors are critical for spatial applications, particularly where these errors place observations beyond the environmental or geographical range of species. These outliers need to be identified, checked and removed to improve data quality and minimize the impact on subsequent analyses. Manually checking all species records within large multispecies datasets is prohibitively costly. This work investigates algorithms that may assist in the efficient vetting of outliers in such large datasets. Location: We used real, spatially explicit environmental data derived from the western part of Victoria, Australia, and simulated species distributions within this same region. Methods: By adapting species distribution modelling (SDM), we developed a pseudo-SDM approach for detecting outliers in species distribution data, which was implemented with random forest (RF) and support vector machine (SVM) resulting in two new methods: RF_pdSDM and SVM_pdSDM. Using virtual species, we compared eight existing multivariate outlier detection methods with these two new methods under various conditions. Results: The two new methods based on the pseudo-SDM approach had higher true skill statistic (TSS) values than other approaches, with TSS values always exceeding 0. More than 70% of the true outliers in datasets for species with a low and intermediate prevalence can be identified by checking 10% of the data points with the highest outlier scores. Main conclusions: Pseudo-SDM-based methods were more effective than other outlier detection methods. However, this outlier detection procedure can only be considered as a screening tool, and putative outliers must be examined by experts to determine whether they are actual errors or important records within an inherently biased set of data.
doi_str_mv 10.1111/jbi.13122
format Article
fullrecord <record><control><sourceid>jstor_proqu</sourceid><recordid>TN_cdi_proquest_journals_1982355858</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><jstor_id>26626798</jstor_id><sourcerecordid>26626798</sourcerecordid><originalsourceid>FETCH-LOGICAL-c3852-f3ccf9d62b139e19597114fa01e98a202b2236a2f6bcb206e450c2003a795bf73</originalsourceid><addsrcrecordid>eNp1kD1PwzAQQC0EEuVj4AcgRWJiSHs-1048QvkqqsQCs2W7NnJUkmI7Qv33BAJs3HLLe3fSI-SMwpQOM2tMmFJGEffIhDLBSxRS7pMJMOAlYAWH5CilBgAkZ_MJmd247GwO7WvR9XkTXExFaIu0dTa4VKxDyjGYPoeuLdY66xNy4PUmudOffUxe7m6fFw_l6ul-ubhalZbVHEvPrPVyLdBQJh2VXFaUzr0G6mStEdAgMqHRC2MNgnBzDhYBmK4kN75ix-RivLuN3XvvUlZN18d2eKmorJFxXvN6oC5HysYupei82sbwpuNOUVBfPdTQQ333GNjZyH6Ejdv9D6rH6-WvcT4aTcpd_DNQCBSVrNkn9rdpNg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1982355858</pqid></control><display><type>article</type><title>Detecting outliers in species distribution data</title><source>Jstor Complete Legacy</source><source>Wiley Online Library Journals Frontfile Complete</source><creator>Liu, Canran ; White, Matt ; Newell, Graeme</creator><creatorcontrib>Liu, Canran ; White, Matt ; Newell, Graeme</creatorcontrib><description>Aim: Species distribution data play a pivotal role in the study of ecology, evolution, biogeography and biodiversity conservation. Although large amounts of location data are available and accessible from public databases, data quality remains problematic. Of the potential sources of error, positional errors are critical for spatial applications, particularly where these errors place observations beyond the environmental or geographical range of species. These outliers need to be identified, checked and removed to improve data quality and minimize the impact on subsequent analyses. Manually checking all species records within large multispecies datasets is prohibitively costly. This work investigates algorithms that may assist in the efficient vetting of outliers in such large datasets. Location: We used real, spatially explicit environmental data derived from the western part of Victoria, Australia, and simulated species distributions within this same region. Methods: By adapting species distribution modelling (SDM), we developed a pseudo-SDM approach for detecting outliers in species distribution data, which was implemented with random forest (RF) and support vector machine (SVM) resulting in two new methods: RF_pdSDM and SVM_pdSDM. Using virtual species, we compared eight existing multivariate outlier detection methods with these two new methods under various conditions. Results: The two new methods based on the pseudo-SDM approach had higher true skill statistic (TSS) values than other approaches, with TSS values always exceeding 0. More than 70% of the true outliers in datasets for species with a low and intermediate prevalence can be identified by checking 10% of the data points with the highest outlier scores. Main conclusions: Pseudo-SDM-based methods were more effective than other outlier detection methods. However, this outlier detection procedure can only be considered as a screening tool, and putative outliers must be examined by experts to determine whether they are actual errors or important records within an inherently biased set of data.</description><identifier>ISSN: 0305-0270</identifier><identifier>EISSN: 1365-2699</identifier><identifier>DOI: 10.1111/jbi.13122</identifier><language>eng</language><publisher>Oxford: John Wiley &amp; Sons Ltd</publisher><subject>Algorithms ; Biodiversity ; Computer simulation ; Data analysis ; Data points ; Datasets ; Ecological monitoring ; Impact analysis ; METHODS AND TOOLS ; outlier ; outlier detection ; Outliers (statistics) ; random forest ; species distribution ; species distribution modelling ; support vector machine ; Support vector machines ; virtual species ; Wildlife conservation</subject><ispartof>Journal of biogeography, 2018-01, Vol.45 (1), p.164-176</ispartof><rights>Copyright © 2017 John Wiley &amp; Sons Ltd.</rights><rights>2017 John Wiley &amp; Sons Ltd</rights><rights>Copyright © 2018 John Wiley &amp; Sons Ltd</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c3852-f3ccf9d62b139e19597114fa01e98a202b2236a2f6bcb206e450c2003a795bf73</citedby><cites>FETCH-LOGICAL-c3852-f3ccf9d62b139e19597114fa01e98a202b2236a2f6bcb206e450c2003a795bf73</cites><orcidid>0000-0001-8023-6758</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.jstor.org/stable/pdf/26626798$$EPDF$$P50$$Gjstor$$H</linktopdf><linktohtml>$$Uhttps://www.jstor.org/stable/26626798$$EHTML$$P50$$Gjstor$$H</linktohtml><link.rule.ids>314,776,780,799,1411,27901,27902,45550,45551,57992,58225</link.rule.ids></links><search><creatorcontrib>Liu, Canran</creatorcontrib><creatorcontrib>White, Matt</creatorcontrib><creatorcontrib>Newell, Graeme</creatorcontrib><title>Detecting outliers in species distribution data</title><title>Journal of biogeography</title><description>Aim: Species distribution data play a pivotal role in the study of ecology, evolution, biogeography and biodiversity conservation. Although large amounts of location data are available and accessible from public databases, data quality remains problematic. Of the potential sources of error, positional errors are critical for spatial applications, particularly where these errors place observations beyond the environmental or geographical range of species. These outliers need to be identified, checked and removed to improve data quality and minimize the impact on subsequent analyses. Manually checking all species records within large multispecies datasets is prohibitively costly. This work investigates algorithms that may assist in the efficient vetting of outliers in such large datasets. Location: We used real, spatially explicit environmental data derived from the western part of Victoria, Australia, and simulated species distributions within this same region. Methods: By adapting species distribution modelling (SDM), we developed a pseudo-SDM approach for detecting outliers in species distribution data, which was implemented with random forest (RF) and support vector machine (SVM) resulting in two new methods: RF_pdSDM and SVM_pdSDM. Using virtual species, we compared eight existing multivariate outlier detection methods with these two new methods under various conditions. Results: The two new methods based on the pseudo-SDM approach had higher true skill statistic (TSS) values than other approaches, with TSS values always exceeding 0. More than 70% of the true outliers in datasets for species with a low and intermediate prevalence can be identified by checking 10% of the data points with the highest outlier scores. Main conclusions: Pseudo-SDM-based methods were more effective than other outlier detection methods. However, this outlier detection procedure can only be considered as a screening tool, and putative outliers must be examined by experts to determine whether they are actual errors or important records within an inherently biased set of data.</description><subject>Algorithms</subject><subject>Biodiversity</subject><subject>Computer simulation</subject><subject>Data analysis</subject><subject>Data points</subject><subject>Datasets</subject><subject>Ecological monitoring</subject><subject>Impact analysis</subject><subject>METHODS AND TOOLS</subject><subject>outlier</subject><subject>outlier detection</subject><subject>Outliers (statistics)</subject><subject>random forest</subject><subject>species distribution</subject><subject>species distribution modelling</subject><subject>support vector machine</subject><subject>Support vector machines</subject><subject>virtual species</subject><subject>Wildlife conservation</subject><issn>0305-0270</issn><issn>1365-2699</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><recordid>eNp1kD1PwzAQQC0EEuVj4AcgRWJiSHs-1048QvkqqsQCs2W7NnJUkmI7Qv33BAJs3HLLe3fSI-SMwpQOM2tMmFJGEffIhDLBSxRS7pMJMOAlYAWH5CilBgAkZ_MJmd247GwO7WvR9XkTXExFaIu0dTa4VKxDyjGYPoeuLdY66xNy4PUmudOffUxe7m6fFw_l6ul-ubhalZbVHEvPrPVyLdBQJh2VXFaUzr0G6mStEdAgMqHRC2MNgnBzDhYBmK4kN75ix-RivLuN3XvvUlZN18d2eKmorJFxXvN6oC5HysYupei82sbwpuNOUVBfPdTQQ333GNjZyH6Ejdv9D6rH6-WvcT4aTcpd_DNQCBSVrNkn9rdpNg</recordid><startdate>201801</startdate><enddate>201801</enddate><creator>Liu, Canran</creator><creator>White, Matt</creator><creator>Newell, Graeme</creator><general>John Wiley &amp; Sons Ltd</general><general>Wiley Subscription Services, Inc</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SN</scope><scope>7SS</scope><scope>8FD</scope><scope>C1K</scope><scope>FR3</scope><scope>P64</scope><scope>RC3</scope><orcidid>https://orcid.org/0000-0001-8023-6758</orcidid></search><sort><creationdate>201801</creationdate><title>Detecting outliers in species distribution data</title><author>Liu, Canran ; White, Matt ; Newell, Graeme</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c3852-f3ccf9d62b139e19597114fa01e98a202b2236a2f6bcb206e450c2003a795bf73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Algorithms</topic><topic>Biodiversity</topic><topic>Computer simulation</topic><topic>Data analysis</topic><topic>Data points</topic><topic>Datasets</topic><topic>Ecological monitoring</topic><topic>Impact analysis</topic><topic>METHODS AND TOOLS</topic><topic>outlier</topic><topic>outlier detection</topic><topic>Outliers (statistics)</topic><topic>random forest</topic><topic>species distribution</topic><topic>species distribution modelling</topic><topic>support vector machine</topic><topic>Support vector machines</topic><topic>virtual species</topic><topic>Wildlife conservation</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Liu, Canran</creatorcontrib><creatorcontrib>White, Matt</creatorcontrib><creatorcontrib>Newell, Graeme</creatorcontrib><collection>CrossRef</collection><collection>Ecology Abstracts</collection><collection>Entomology Abstracts (Full archive)</collection><collection>Technology Research Database</collection><collection>Environmental Sciences and Pollution Management</collection><collection>Engineering Research Database</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Genetics Abstracts</collection><jtitle>Journal of biogeography</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Liu, Canran</au><au>White, Matt</au><au>Newell, Graeme</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Detecting outliers in species distribution data</atitle><jtitle>Journal of biogeography</jtitle><date>2018-01</date><risdate>2018</risdate><volume>45</volume><issue>1</issue><spage>164</spage><epage>176</epage><pages>164-176</pages><issn>0305-0270</issn><eissn>1365-2699</eissn><abstract>Aim: Species distribution data play a pivotal role in the study of ecology, evolution, biogeography and biodiversity conservation. Although large amounts of location data are available and accessible from public databases, data quality remains problematic. Of the potential sources of error, positional errors are critical for spatial applications, particularly where these errors place observations beyond the environmental or geographical range of species. These outliers need to be identified, checked and removed to improve data quality and minimize the impact on subsequent analyses. Manually checking all species records within large multispecies datasets is prohibitively costly. This work investigates algorithms that may assist in the efficient vetting of outliers in such large datasets. Location: We used real, spatially explicit environmental data derived from the western part of Victoria, Australia, and simulated species distributions within this same region. Methods: By adapting species distribution modelling (SDM), we developed a pseudo-SDM approach for detecting outliers in species distribution data, which was implemented with random forest (RF) and support vector machine (SVM) resulting in two new methods: RF_pdSDM and SVM_pdSDM. Using virtual species, we compared eight existing multivariate outlier detection methods with these two new methods under various conditions. Results: The two new methods based on the pseudo-SDM approach had higher true skill statistic (TSS) values than other approaches, with TSS values always exceeding 0. More than 70% of the true outliers in datasets for species with a low and intermediate prevalence can be identified by checking 10% of the data points with the highest outlier scores. Main conclusions: Pseudo-SDM-based methods were more effective than other outlier detection methods. However, this outlier detection procedure can only be considered as a screening tool, and putative outliers must be examined by experts to determine whether they are actual errors or important records within an inherently biased set of data.</abstract><cop>Oxford</cop><pub>John Wiley &amp; Sons Ltd</pub><doi>10.1111/jbi.13122</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0001-8023-6758</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 0305-0270
ispartof Journal of biogeography, 2018-01, Vol.45 (1), p.164-176
issn 0305-0270
1365-2699
language eng
recordid cdi_proquest_journals_1982355858
source Jstor Complete Legacy; Wiley Online Library Journals Frontfile Complete
subjects Algorithms
Biodiversity
Computer simulation
Data analysis
Data points
Datasets
Ecological monitoring
Impact analysis
METHODS AND TOOLS
outlier
outlier detection
Outliers (statistics)
random forest
species distribution
species distribution modelling
support vector machine
Support vector machines
virtual species
Wildlife conservation
title Detecting outliers in species distribution data
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-04T01%3A38%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-jstor_proqu&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Detecting%20outliers%20in%20species%20distribution%20data&rft.jtitle=Journal%20of%20biogeography&rft.au=Liu,%20Canran&rft.date=2018-01&rft.volume=45&rft.issue=1&rft.spage=164&rft.epage=176&rft.pages=164-176&rft.issn=0305-0270&rft.eissn=1365-2699&rft_id=info:doi/10.1111/jbi.13122&rft_dat=%3Cjstor_proqu%3E26626798%3C/jstor_proqu%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1982355858&rft_id=info:pmid/&rft_jstor_id=26626798&rfr_iscdi=true