Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics

Contaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algori...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:FRONTIERS IN MICROBIOLOGY 2021-10, Vol.12
Hauptverfasser: Lupo, Valerian, Van Vlierberghe, Mick, Vanderschuren, Herve, Kerff, Frederic, Baurain, Denis, Cornet, Luc
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title FRONTIERS IN MICROBIOLOGY
container_volume 12
creator Lupo, Valerian
Van Vlierberghe, Mick
Vanderschuren, Herve
Kerff, Frederic
Baurain, Denis
Cornet, Luc
description Contaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algorithms that can sometimes produce widely different estimates of the contamination level, the majority of genomic studies rely on a single method of detection, which represents a risk of systematic error. In this work, we used two orthogonal methods to assess the level of contamination among National Center for Biotechnological Information Reference Sequence Database (RefSeq) bacterial genomes. First, we applied the most popular solution, CheckM, which is based on gene markers. We then complemented this approach by a genome-wide method, termed Physeter, which now implements a k-folds algorithm to avoid inaccurate detection due to potential contamination of the reference database. We demonstrate that CheckM cannot currently be applied to all available genomes and bacterial groups. While it performed well on the majority of RefSeq genomes, it produced dubious results for 12,326 organisms. Among those, Physeter identified 239 contaminated genomes that had been missed by CheckM. In conclusion, we emphasize the importance of using multiple methods of detection while providing an upgrade of our own detection tool, Physeter, which minimizes incorrect contamination estimates in the context of unavoidably contaminated reference databases.
format Article
fullrecord <record><control><sourceid>kuleuven</sourceid><recordid>TN_cdi_kuleuven_dspace_20_500_12942_687168</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>20_500_12942_687168</sourcerecordid><originalsourceid>FETCH-kuleuven_dspace_20_500_12942_6871683</originalsourceid><addsrcrecordid>eNqVjLEOgjAURRujiUT5h84mNaUgoito3EyQwa15wiOpQlFaiJ8vMQ6Oeod7z3ByR8TxwjBgPhfn8RdPiWvMlQ8JuBjaIce40RZqpcGqRlOlaYoltqhzpCd8dG9IwMIFDJotzVSNtGxamqheFchAFyztKqQZ5FblZk4mJVQG3c_OyGK_y-IDuw1S16OWhblDjlJwueJcemITCBlGay-M_BlZ_ixL-7T-X-8vKUhRQA</addsrcrecordid><sourcetype>Institutional Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics</title><source>Lirias (KU Leuven Association)</source><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>PubMed Central Open Access</source><source>PubMed Central</source><creator>Lupo, Valerian ; Van Vlierberghe, Mick ; Vanderschuren, Herve ; Kerff, Frederic ; Baurain, Denis ; Cornet, Luc</creator><creatorcontrib>Lupo, Valerian ; Van Vlierberghe, Mick ; Vanderschuren, Herve ; Kerff, Frederic ; Baurain, Denis ; Cornet, Luc</creatorcontrib><description>Contaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algorithms that can sometimes produce widely different estimates of the contamination level, the majority of genomic studies rely on a single method of detection, which represents a risk of systematic error. In this work, we used two orthogonal methods to assess the level of contamination among National Center for Biotechnological Information Reference Sequence Database (RefSeq) bacterial genomes. First, we applied the most popular solution, CheckM, which is based on gene markers. We then complemented this approach by a genome-wide method, termed Physeter, which now implements a k-folds algorithm to avoid inaccurate detection due to potential contamination of the reference database. We demonstrate that CheckM cannot currently be applied to all available genomes and bacterial groups. While it performed well on the majority of RefSeq genomes, it produced dubious results for 12,326 organisms. Among those, Physeter identified 239 contaminated genomes that had been missed by CheckM. In conclusion, we emphasize the importance of using multiple methods of detection while providing an upgrade of our own detection tool, Physeter, which minimizes incorrect contamination estimates in the context of unavoidably contaminated reference databases.</description><identifier>ISSN: 1664-302X</identifier><identifier>EISSN: 1664-302X</identifier><language>eng</language><publisher>FRONTIERS MEDIA SA</publisher><ispartof>FRONTIERS IN MICROBIOLOGY, 2021-10, Vol.12</ispartof><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>315,316,781,785,27865</link.rule.ids></links><search><creatorcontrib>Lupo, Valerian</creatorcontrib><creatorcontrib>Van Vlierberghe, Mick</creatorcontrib><creatorcontrib>Vanderschuren, Herve</creatorcontrib><creatorcontrib>Kerff, Frederic</creatorcontrib><creatorcontrib>Baurain, Denis</creatorcontrib><creatorcontrib>Cornet, Luc</creatorcontrib><title>Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics</title><title>FRONTIERS IN MICROBIOLOGY</title><description>Contaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algorithms that can sometimes produce widely different estimates of the contamination level, the majority of genomic studies rely on a single method of detection, which represents a risk of systematic error. In this work, we used two orthogonal methods to assess the level of contamination among National Center for Biotechnological Information Reference Sequence Database (RefSeq) bacterial genomes. First, we applied the most popular solution, CheckM, which is based on gene markers. We then complemented this approach by a genome-wide method, termed Physeter, which now implements a k-folds algorithm to avoid inaccurate detection due to potential contamination of the reference database. We demonstrate that CheckM cannot currently be applied to all available genomes and bacterial groups. While it performed well on the majority of RefSeq genomes, it produced dubious results for 12,326 organisms. Among those, Physeter identified 239 contaminated genomes that had been missed by CheckM. In conclusion, we emphasize the importance of using multiple methods of detection while providing an upgrade of our own detection tool, Physeter, which minimizes incorrect contamination estimates in the context of unavoidably contaminated reference databases.</description><issn>1664-302X</issn><issn>1664-302X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>FZOIL</sourceid><recordid>eNqVjLEOgjAURRujiUT5h84mNaUgoito3EyQwa15wiOpQlFaiJ8vMQ6Oeod7z3ByR8TxwjBgPhfn8RdPiWvMlQ8JuBjaIce40RZqpcGqRlOlaYoltqhzpCd8dG9IwMIFDJotzVSNtGxamqheFchAFyztKqQZ5FblZk4mJVQG3c_OyGK_y-IDuw1S16OWhblDjlJwueJcemITCBlGay-M_BlZ_ixL-7T-X-8vKUhRQA</recordid><startdate>20211022</startdate><enddate>20211022</enddate><creator>Lupo, Valerian</creator><creator>Van Vlierberghe, Mick</creator><creator>Vanderschuren, Herve</creator><creator>Kerff, Frederic</creator><creator>Baurain, Denis</creator><creator>Cornet, Luc</creator><general>FRONTIERS MEDIA SA</general><scope>FZOIL</scope></search><sort><creationdate>20211022</creationdate><title>Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics</title><author>Lupo, Valerian ; Van Vlierberghe, Mick ; Vanderschuren, Herve ; Kerff, Frederic ; Baurain, Denis ; Cornet, Luc</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-kuleuven_dspace_20_500_12942_6871683</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Lupo, Valerian</creatorcontrib><creatorcontrib>Van Vlierberghe, Mick</creatorcontrib><creatorcontrib>Vanderschuren, Herve</creatorcontrib><creatorcontrib>Kerff, Frederic</creatorcontrib><creatorcontrib>Baurain, Denis</creatorcontrib><creatorcontrib>Cornet, Luc</creatorcontrib><collection>Lirias (KU Leuven Association)</collection><jtitle>FRONTIERS IN MICROBIOLOGY</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Lupo, Valerian</au><au>Van Vlierberghe, Mick</au><au>Vanderschuren, Herve</au><au>Kerff, Frederic</au><au>Baurain, Denis</au><au>Cornet, Luc</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics</atitle><jtitle>FRONTIERS IN MICROBIOLOGY</jtitle><date>2021-10-22</date><risdate>2021</risdate><volume>12</volume><issn>1664-302X</issn><eissn>1664-302X</eissn><abstract>Contaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algorithms that can sometimes produce widely different estimates of the contamination level, the majority of genomic studies rely on a single method of detection, which represents a risk of systematic error. In this work, we used two orthogonal methods to assess the level of contamination among National Center for Biotechnological Information Reference Sequence Database (RefSeq) bacterial genomes. First, we applied the most popular solution, CheckM, which is based on gene markers. We then complemented this approach by a genome-wide method, termed Physeter, which now implements a k-folds algorithm to avoid inaccurate detection due to potential contamination of the reference database. We demonstrate that CheckM cannot currently be applied to all available genomes and bacterial groups. While it performed well on the majority of RefSeq genomes, it produced dubious results for 12,326 organisms. Among those, Physeter identified 239 contaminated genomes that had been missed by CheckM. In conclusion, we emphasize the importance of using multiple methods of detection while providing an upgrade of our own detection tool, Physeter, which minimizes incorrect contamination estimates in the context of unavoidably contaminated reference databases.</abstract><pub>FRONTIERS MEDIA SA</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1664-302X
ispartof FRONTIERS IN MICROBIOLOGY, 2021-10, Vol.12
issn 1664-302X
1664-302X
language eng
recordid cdi_kuleuven_dspace_20_500_12942_687168
source Lirias (KU Leuven Association); DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; PubMed Central Open Access; PubMed Central
title Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-14T21%3A21%3A53IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-kuleuven&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Contamination%20in%20Reference%20Sequence%20Databases:%20Time%20for%20Divide-and-Rule%20Tactics&rft.jtitle=FRONTIERS%20IN%20MICROBIOLOGY&rft.au=Lupo,%20Valerian&rft.date=2021-10-22&rft.volume=12&rft.issn=1664-302X&rft.eissn=1664-302X&rft_id=info:doi/&rft_dat=%3Ckuleuven%3E20_500_12942_687168%3C/kuleuven%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true