CRISPRidentify: identification of CRISPR arrays using machine learning approach

Abstract CRISPR–Cas are adaptive immune systems that degrade foreign genetic elements in archaea and bacteria. In carrying out their immune functions, CRISPR–Cas systems heavily rely on RNA components. These CRISPR (cr) RNAs are repeat-spacer units that are produced by processing of pre-crRNA, the t...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Nucleic acids research 2021-02, Vol.49 (4), p.e20-e20
Hauptverfasser: Mitrofanov, Alexander, Alkhnbashi, Omer S, Shmakov, Sergey A, Makarova, Kira S, Koonin, Eugene V, Backofen, Rolf
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page e20
container_issue 4
container_start_page e20
container_title Nucleic acids research
container_volume 49
creator Mitrofanov, Alexander
Alkhnbashi, Omer S
Shmakov, Sergey A
Makarova, Kira S
Koonin, Eugene V
Backofen, Rolf
description Abstract CRISPR–Cas are adaptive immune systems that degrade foreign genetic elements in archaea and bacteria. In carrying out their immune functions, CRISPR–Cas systems heavily rely on RNA components. These CRISPR (cr) RNAs are repeat-spacer units that are produced by processing of pre-crRNA, the transcript of CRISPR arrays, and guide Cas protein(s) to the cognate invading nucleic acids, enabling their destruction. Several bioinformatics tools have been developed to detect CRISPR arrays based solely on DNA sequences, but all these tools employ the same strategy of looking for repetitive patterns, which might correspond to CRISPR array repeats. The identified patterns are evaluated using a fixed, built-in scoring function, and arrays exceeding a cut-off value are reported. Here, we instead introduce a data-driven approach that uses machine learning to detect and differentiate true CRISPR arrays from false ones based on several features. Our CRISPR detection tool, CRISPRidentify, performs three steps: detection, feature extraction and classification based on manually curated sets of positive and negative examples of CRISPR arrays. The identified CRISPR arrays are then reported to the user accompanied by detailed annotation. We demonstrate that our approach identifies not only previously detected CRISPR arrays, but also CRISPR array candidates not detected by other tools. Compared to other methods, our tool has a drastically reduced false positive rate. In contrast to the existing tools, our approach not only provides the user with the basic statistics on the identified CRISPR arrays but also produces a certainty score as a practical measure of the likelihood that a given genomic region is a CRISPR array.
doi_str_mv 10.1093/nar/gkaa1158
format Article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_7913763</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><oup_id>10.1093/nar/gkaa1158</oup_id><sourcerecordid>2468658334</sourcerecordid><originalsourceid>FETCH-LOGICAL-c416t-5ccea99ce6ac031cd0c820448f1b466de437596629cd79dc53073a41001625b93</originalsourceid><addsrcrecordid>eNp9kUlPwzAQhS0EoqVw44xygwOhdrwk5oCEKpZKlYoKnK2p47SGNA52g9R_T6ouggunGc18erM8hM4JviFY0n4Fvj_7BCCEZweoS6hIYiZFcoi6mGIeE8yyDjoJ4QNjwghnx6hDaSIxx7yLxoPJ8PVlYnNTLW2xuo22mdWwtK6KXBFtiAi8h1WImmCrWbQAPbeViUoDvloXoK69a4un6KiAMpizbeyh98eHt8FzPBo_DQf3o1gzIpYx19qAlNoI0JgSnWOdJZixrCBTJkRuGE25FCKROk9lrjnFKQVG2hNEwqeS9tDdRrdupguT63ZpD6WqvV2AXykHVv3tVHauZu5bpZLQVNBW4Gor4N1XY8JSLWzQpiyhMq4JKmEiEzyjlLXo9QbV3oXgTbEfQ7Bae6BaD9TOgxa_-L3aHt49vQUuN4Br6v-lfgC8W5D-</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2468658334</pqid></control><display><type>article</type><title>CRISPRidentify: identification of CRISPR arrays using machine learning approach</title><source>MEDLINE</source><source>Full-Text Journals in Chemistry (Open access)</source><source>DOAJ Directory of Open Access Journals</source><source>Oxford University Press Open Access</source><source>PubMed Central</source><creator>Mitrofanov, Alexander ; Alkhnbashi, Omer S ; Shmakov, Sergey A ; Makarova, Kira S ; Koonin, Eugene V ; Backofen, Rolf</creator><creatorcontrib>Mitrofanov, Alexander ; Alkhnbashi, Omer S ; Shmakov, Sergey A ; Makarova, Kira S ; Koonin, Eugene V ; Backofen, Rolf</creatorcontrib><description>Abstract CRISPR–Cas are adaptive immune systems that degrade foreign genetic elements in archaea and bacteria. In carrying out their immune functions, CRISPR–Cas systems heavily rely on RNA components. These CRISPR (cr) RNAs are repeat-spacer units that are produced by processing of pre-crRNA, the transcript of CRISPR arrays, and guide Cas protein(s) to the cognate invading nucleic acids, enabling their destruction. Several bioinformatics tools have been developed to detect CRISPR arrays based solely on DNA sequences, but all these tools employ the same strategy of looking for repetitive patterns, which might correspond to CRISPR array repeats. The identified patterns are evaluated using a fixed, built-in scoring function, and arrays exceeding a cut-off value are reported. Here, we instead introduce a data-driven approach that uses machine learning to detect and differentiate true CRISPR arrays from false ones based on several features. Our CRISPR detection tool, CRISPRidentify, performs three steps: detection, feature extraction and classification based on manually curated sets of positive and negative examples of CRISPR arrays. The identified CRISPR arrays are then reported to the user accompanied by detailed annotation. We demonstrate that our approach identifies not only previously detected CRISPR arrays, but also CRISPR array candidates not detected by other tools. Compared to other methods, our tool has a drastically reduced false positive rate. In contrast to the existing tools, our approach not only provides the user with the basic statistics on the identified CRISPR arrays but also produces a certainty score as a practical measure of the likelihood that a given genomic region is a CRISPR array.</description><identifier>ISSN: 0305-1048</identifier><identifier>EISSN: 1362-4962</identifier><identifier>DOI: 10.1093/nar/gkaa1158</identifier><identifier>PMID: 33290505</identifier><language>eng</language><publisher>England: Oxford University Press</publisher><subject>Clustered Regularly Interspaced Short Palindromic Repeats ; Genome, Archaeal ; Genome, Bacterial ; Machine Learning ; Methods Online ; Software</subject><ispartof>Nucleic acids research, 2021-02, Vol.49 (4), p.e20-e20</ispartof><rights>The Author(s) 2020. Published by Oxford University Press on behalf of Nucleic Acids Research. 2021</rights><rights>The Author(s) 2020. Published by Oxford University Press on behalf of Nucleic Acids Research.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c416t-5ccea99ce6ac031cd0c820448f1b466de437596629cd79dc53073a41001625b93</citedby><cites>FETCH-LOGICAL-c416t-5ccea99ce6ac031cd0c820448f1b466de437596629cd79dc53073a41001625b93</cites><orcidid>0000-0001-8088-590X ; 0000-0003-3943-8299 ; 0000-0001-8231-3323 ; 0000-0002-8174-2844</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC7913763/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC7913763/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,727,780,784,864,885,1604,27924,27925,53791,53793</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/33290505$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Mitrofanov, Alexander</creatorcontrib><creatorcontrib>Alkhnbashi, Omer S</creatorcontrib><creatorcontrib>Shmakov, Sergey A</creatorcontrib><creatorcontrib>Makarova, Kira S</creatorcontrib><creatorcontrib>Koonin, Eugene V</creatorcontrib><creatorcontrib>Backofen, Rolf</creatorcontrib><title>CRISPRidentify: identification of CRISPR arrays using machine learning approach</title><title>Nucleic acids research</title><addtitle>Nucleic Acids Res</addtitle><description>Abstract CRISPR–Cas are adaptive immune systems that degrade foreign genetic elements in archaea and bacteria. In carrying out their immune functions, CRISPR–Cas systems heavily rely on RNA components. These CRISPR (cr) RNAs are repeat-spacer units that are produced by processing of pre-crRNA, the transcript of CRISPR arrays, and guide Cas protein(s) to the cognate invading nucleic acids, enabling their destruction. Several bioinformatics tools have been developed to detect CRISPR arrays based solely on DNA sequences, but all these tools employ the same strategy of looking for repetitive patterns, which might correspond to CRISPR array repeats. The identified patterns are evaluated using a fixed, built-in scoring function, and arrays exceeding a cut-off value are reported. Here, we instead introduce a data-driven approach that uses machine learning to detect and differentiate true CRISPR arrays from false ones based on several features. Our CRISPR detection tool, CRISPRidentify, performs three steps: detection, feature extraction and classification based on manually curated sets of positive and negative examples of CRISPR arrays. The identified CRISPR arrays are then reported to the user accompanied by detailed annotation. We demonstrate that our approach identifies not only previously detected CRISPR arrays, but also CRISPR array candidates not detected by other tools. Compared to other methods, our tool has a drastically reduced false positive rate. In contrast to the existing tools, our approach not only provides the user with the basic statistics on the identified CRISPR arrays but also produces a certainty score as a practical measure of the likelihood that a given genomic region is a CRISPR array.</description><subject>Clustered Regularly Interspaced Short Palindromic Repeats</subject><subject>Genome, Archaeal</subject><subject>Genome, Bacterial</subject><subject>Machine Learning</subject><subject>Methods Online</subject><subject>Software</subject><issn>0305-1048</issn><issn>1362-4962</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>TOX</sourceid><sourceid>EIF</sourceid><recordid>eNp9kUlPwzAQhS0EoqVw44xygwOhdrwk5oCEKpZKlYoKnK2p47SGNA52g9R_T6ouggunGc18erM8hM4JviFY0n4Fvj_7BCCEZweoS6hIYiZFcoi6mGIeE8yyDjoJ4QNjwghnx6hDaSIxx7yLxoPJ8PVlYnNTLW2xuo22mdWwtK6KXBFtiAi8h1WImmCrWbQAPbeViUoDvloXoK69a4un6KiAMpizbeyh98eHt8FzPBo_DQf3o1gzIpYx19qAlNoI0JgSnWOdJZixrCBTJkRuGE25FCKROk9lrjnFKQVG2hNEwqeS9tDdRrdupguT63ZpD6WqvV2AXykHVv3tVHauZu5bpZLQVNBW4Gor4N1XY8JSLWzQpiyhMq4JKmEiEzyjlLXo9QbV3oXgTbEfQ7Bae6BaD9TOgxa_-L3aHt49vQUuN4Br6v-lfgC8W5D-</recordid><startdate>20210226</startdate><enddate>20210226</enddate><creator>Mitrofanov, Alexander</creator><creator>Alkhnbashi, Omer S</creator><creator>Shmakov, Sergey A</creator><creator>Makarova, Kira S</creator><creator>Koonin, Eugene V</creator><creator>Backofen, Rolf</creator><general>Oxford University Press</general><scope>TOX</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0001-8088-590X</orcidid><orcidid>https://orcid.org/0000-0003-3943-8299</orcidid><orcidid>https://orcid.org/0000-0001-8231-3323</orcidid><orcidid>https://orcid.org/0000-0002-8174-2844</orcidid></search><sort><creationdate>20210226</creationdate><title>CRISPRidentify: identification of CRISPR arrays using machine learning approach</title><author>Mitrofanov, Alexander ; Alkhnbashi, Omer S ; Shmakov, Sergey A ; Makarova, Kira S ; Koonin, Eugene V ; Backofen, Rolf</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c416t-5ccea99ce6ac031cd0c820448f1b466de437596629cd79dc53073a41001625b93</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Clustered Regularly Interspaced Short Palindromic Repeats</topic><topic>Genome, Archaeal</topic><topic>Genome, Bacterial</topic><topic>Machine Learning</topic><topic>Methods Online</topic><topic>Software</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Mitrofanov, Alexander</creatorcontrib><creatorcontrib>Alkhnbashi, Omer S</creatorcontrib><creatorcontrib>Shmakov, Sergey A</creatorcontrib><creatorcontrib>Makarova, Kira S</creatorcontrib><creatorcontrib>Koonin, Eugene V</creatorcontrib><creatorcontrib>Backofen, Rolf</creatorcontrib><collection>Oxford University Press Open Access</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Nucleic acids research</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Mitrofanov, Alexander</au><au>Alkhnbashi, Omer S</au><au>Shmakov, Sergey A</au><au>Makarova, Kira S</au><au>Koonin, Eugene V</au><au>Backofen, Rolf</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>CRISPRidentify: identification of CRISPR arrays using machine learning approach</atitle><jtitle>Nucleic acids research</jtitle><addtitle>Nucleic Acids Res</addtitle><date>2021-02-26</date><risdate>2021</risdate><volume>49</volume><issue>4</issue><spage>e20</spage><epage>e20</epage><pages>e20-e20</pages><issn>0305-1048</issn><eissn>1362-4962</eissn><abstract>Abstract CRISPR–Cas are adaptive immune systems that degrade foreign genetic elements in archaea and bacteria. In carrying out their immune functions, CRISPR–Cas systems heavily rely on RNA components. These CRISPR (cr) RNAs are repeat-spacer units that are produced by processing of pre-crRNA, the transcript of CRISPR arrays, and guide Cas protein(s) to the cognate invading nucleic acids, enabling their destruction. Several bioinformatics tools have been developed to detect CRISPR arrays based solely on DNA sequences, but all these tools employ the same strategy of looking for repetitive patterns, which might correspond to CRISPR array repeats. The identified patterns are evaluated using a fixed, built-in scoring function, and arrays exceeding a cut-off value are reported. Here, we instead introduce a data-driven approach that uses machine learning to detect and differentiate true CRISPR arrays from false ones based on several features. Our CRISPR detection tool, CRISPRidentify, performs three steps: detection, feature extraction and classification based on manually curated sets of positive and negative examples of CRISPR arrays. The identified CRISPR arrays are then reported to the user accompanied by detailed annotation. We demonstrate that our approach identifies not only previously detected CRISPR arrays, but also CRISPR array candidates not detected by other tools. Compared to other methods, our tool has a drastically reduced false positive rate. In contrast to the existing tools, our approach not only provides the user with the basic statistics on the identified CRISPR arrays but also produces a certainty score as a practical measure of the likelihood that a given genomic region is a CRISPR array.</abstract><cop>England</cop><pub>Oxford University Press</pub><pmid>33290505</pmid><doi>10.1093/nar/gkaa1158</doi><orcidid>https://orcid.org/0000-0001-8088-590X</orcidid><orcidid>https://orcid.org/0000-0003-3943-8299</orcidid><orcidid>https://orcid.org/0000-0001-8231-3323</orcidid><orcidid>https://orcid.org/0000-0002-8174-2844</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0305-1048
ispartof Nucleic acids research, 2021-02, Vol.49 (4), p.e20-e20
issn 0305-1048
1362-4962
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_7913763
source MEDLINE; Full-Text Journals in Chemistry (Open access); DOAJ Directory of Open Access Journals; Oxford University Press Open Access; PubMed Central
subjects Clustered Regularly Interspaced Short Palindromic Repeats
Genome, Archaeal
Genome, Bacterial
Machine Learning
Methods Online
Software
title CRISPRidentify: identification of CRISPR arrays using machine learning approach
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T17%3A53%3A30IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=CRISPRidentify:%20identification%20of%20CRISPR%20arrays%20using%20machine%20learning%20approach&rft.jtitle=Nucleic%20acids%20research&rft.au=Mitrofanov,%20Alexander&rft.date=2021-02-26&rft.volume=49&rft.issue=4&rft.spage=e20&rft.epage=e20&rft.pages=e20-e20&rft.issn=0305-1048&rft.eissn=1362-4962&rft_id=info:doi/10.1093/nar/gkaa1158&rft_dat=%3Cproquest_pubme%3E2468658334%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2468658334&rft_id=info:pmid/33290505&rft_oup_id=10.1093/nar/gkaa1158&rfr_iscdi=true