Comparison of Statistical Methods to Classify Environmental Genomic Fragments

"Binning" (or taxonomic classification) of DNA sequence reads is an initial step to analyzing an environmental biological sample. Currently, a homology-based tool, BLAST, is one of the most commonly used tools to label DNA reads, but it is argued that BLAST will quickly lose its classifica...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on nanobioscience 2010-12, Vol.9 (4), p.310-316
Hauptverfasser:	Rosen, G L, Essinger, S D
Format:	Magazinearticle
Sprache:	eng
Schlagworte:	Accuracy Bayes Theorem Bayesian classification Bioinformatics Databases, Genetic DNA Genome Genomics language models metagenomics Metagenomics - methods Models, Statistical Peptide Fragments - classification Sequence Analysis, DNA - methods Statistical learning Taxonomy Training Training data
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	316
container_issue	4
container_start_page	310
container_title	IEEE transactions on nanobioscience
container_volume	9
creator	Rosen, G L Essinger, S D
description	"Binning" (or taxonomic classification) of DNA sequence reads is an initial step to analyzing an environmental biological sample. Currently, a homology-based tool, BLAST, is one of the most commonly used tools to label DNA reads, but it is argued that BLAST will quickly lose its classification ability as the genome databases grow. In this paper, we compare the accuracies of a naïve Bayes classifier (NBC) and statistical language model to BLAST for binning reads and demonstrate that NBC obtains good performance for the low cost of computational complexity. On the other hand, the back-off n-gram language model can improve accuracy when only partial training data is available (such as in-progress sequencing projects). NBC demonstrates comparable performance to BLAST and can also be optimized on partial training datasets by adjusting the word feature size. A fivefold cross validation is conducted to compare each method's accuracy for determining novel genomes at different taxonomic levels, with NBC outperforming BLAST for species-level classification but BLAST outperforming NBC for genus-level and phyla-level classification. In conclusion, the NBC is a competitive taxonomic classifier, and language models can improve performance when only partial training data is available.
doi_str_mv	10.1109/TNB.2010.2081375
format	Magazinearticle
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_pubmed_primary_20876033</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>5586656</ieee_id><sourcerecordid>861564163</sourcerecordid><originalsourceid>FETCH-LOGICAL-c424t-f08f6e9171ba7ec35b54f82720f1c2ad2c29822c5908d1cef6cd97aaefd6a4b13</originalsourceid><addsrcrecordid>eNqFkctPxCAQh4nR-L6bmJjGi6cqAwXKUTe-Eh8H1zNhKSimLSt0Tfzvpdl1D148MTDf_CbkQ-gI8DkAlhfTp6tzgvON4BqoYBtoFxirS8Kp3BxryksgFeygvZQ-MAbBmdxGOxkXHFO6ix4noZvr6FPoi-CKl0EPPg3e6LZ4tMN7aFIxhGLS6pS8-y6u-y8fQ9_ZfsjEre1D501xE_Xb-JQO0JbTbbKHq3Mfvd5cTyd35cPz7f3k8qE0FamG0uHacStBwEwLayibscrVRBDswBDdEENkTYhhEtcNGOu4aaTQ2rqG62oGdB-dLXPnMXwubBpU55Oxbat7GxZJ1RwYr4DT_8lKMCkwY5k8_UN-hEXs8zdUzYBwAXJcjJeQiSGlaJ2aR9_p-K0Aq1GJykrUqEStlOSRk1XuYtbZZj3w6yADx0vAW2vX7ayRc8bpD_3Dj0Q</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>magazinearticle</recordtype><pqid>851267191</pqid></control><display><type>magazinearticle</type><title>Comparison of Statistical Methods to Classify Environmental Genomic Fragments</title><source>IEEE/IET Electronic Library</source><creator>Rosen, G L ; Essinger, S D</creator><creatorcontrib>Rosen, G L ; Essinger, S D</creatorcontrib><description>"Binning" (or taxonomic classification) of DNA sequence reads is an initial step to analyzing an environmental biological sample. Currently, a homology-based tool, BLAST, is one of the most commonly used tools to label DNA reads, but it is argued that BLAST will quickly lose its classification ability as the genome databases grow. In this paper, we compare the accuracies of a naïve Bayes classifier (NBC) and statistical language model to BLAST for binning reads and demonstrate that NBC obtains good performance for the low cost of computational complexity. On the other hand, the back-off n-gram language model can improve accuracy when only partial training data is available (such as in-progress sequencing projects). NBC demonstrates comparable performance to BLAST and can also be optimized on partial training datasets by adjusting the word feature size. A fivefold cross validation is conducted to compare each method's accuracy for determining novel genomes at different taxonomic levels, with NBC outperforming BLAST for species-level classification but BLAST outperforming NBC for genus-level and phyla-level classification. In conclusion, the NBC is a competitive taxonomic classifier, and language models can improve performance when only partial training data is available.</description><identifier>ISSN: 1536-1241</identifier><identifier>EISSN: 1558-2639</identifier><identifier>DOI: 10.1109/TNB.2010.2081375</identifier><identifier>PMID: 20876033</identifier><identifier>CODEN: ITMCEL</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Accuracy ; Bayes Theorem ; Bayesian classification ; Bioinformatics ; Databases, Genetic ; DNA ; Genome ; Genomics ; language models ; metagenomics ; Metagenomics - methods ; Models, Statistical ; Peptide Fragments - classification ; Sequence Analysis, DNA - methods ; Statistical learning ; Taxonomy ; Training ; Training data</subject><ispartof>IEEE transactions on nanobioscience, 2010-12, Vol.9 (4), p.310-316</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Dec 2010</rights><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c424t-f08f6e9171ba7ec35b54f82720f1c2ad2c29822c5908d1cef6cd97aaefd6a4b13</citedby><cites>FETCH-LOGICAL-c424t-f08f6e9171ba7ec35b54f82720f1c2ad2c29822c5908d1cef6cd97aaefd6a4b13</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/5586656$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>780,784,796,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/5586656$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/20876033$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Rosen, G L</creatorcontrib><creatorcontrib>Essinger, S D</creatorcontrib><title>Comparison of Statistical Methods to Classify Environmental Genomic Fragments</title><title>IEEE transactions on nanobioscience</title><addtitle>TNB</addtitle><addtitle>IEEE Trans Nanobioscience</addtitle><description>"Binning" (or taxonomic classification) of DNA sequence reads is an initial step to analyzing an environmental biological sample. Currently, a homology-based tool, BLAST, is one of the most commonly used tools to label DNA reads, but it is argued that BLAST will quickly lose its classification ability as the genome databases grow. In this paper, we compare the accuracies of a naïve Bayes classifier (NBC) and statistical language model to BLAST for binning reads and demonstrate that NBC obtains good performance for the low cost of computational complexity. On the other hand, the back-off n-gram language model can improve accuracy when only partial training data is available (such as in-progress sequencing projects). NBC demonstrates comparable performance to BLAST and can also be optimized on partial training datasets by adjusting the word feature size. A fivefold cross validation is conducted to compare each method's accuracy for determining novel genomes at different taxonomic levels, with NBC outperforming BLAST for species-level classification but BLAST outperforming NBC for genus-level and phyla-level classification. In conclusion, the NBC is a competitive taxonomic classifier, and language models can improve performance when only partial training data is available.</description><subject>Accuracy</subject><subject>Bayes Theorem</subject><subject>Bayesian classification</subject><subject>Bioinformatics</subject><subject>Databases, Genetic</subject><subject>DNA</subject><subject>Genome</subject><subject>Genomics</subject><subject>language models</subject><subject>metagenomics</subject><subject>Metagenomics - methods</subject><subject>Models, Statistical</subject><subject>Peptide Fragments - classification</subject><subject>Sequence Analysis, DNA - methods</subject><subject>Statistical learning</subject><subject>Taxonomy</subject><subject>Training</subject><subject>Training data</subject><issn>1536-1241</issn><issn>1558-2639</issn><fulltext>true</fulltext><rsrctype>magazinearticle</rsrctype><creationdate>2010</creationdate><recordtype>magazinearticle</recordtype><sourceid>RIE</sourceid><sourceid>EIF</sourceid><recordid>eNqFkctPxCAQh4nR-L6bmJjGi6cqAwXKUTe-Eh8H1zNhKSimLSt0Tfzvpdl1D148MTDf_CbkQ-gI8DkAlhfTp6tzgvON4BqoYBtoFxirS8Kp3BxryksgFeygvZQ-MAbBmdxGOxkXHFO6ix4noZvr6FPoi-CKl0EPPg3e6LZ4tMN7aFIxhGLS6pS8-y6u-y8fQ9_ZfsjEre1D501xE_Xb-JQO0JbTbbKHq3Mfvd5cTyd35cPz7f3k8qE0FamG0uHacStBwEwLayibscrVRBDswBDdEENkTYhhEtcNGOu4aaTQ2rqG62oGdB-dLXPnMXwubBpU55Oxbat7GxZJ1RwYr4DT_8lKMCkwY5k8_UN-hEXs8zdUzYBwAXJcjJeQiSGlaJ2aR9_p-K0Aq1GJykrUqEStlOSRk1XuYtbZZj3w6yADx0vAW2vX7ayRc8bpD_3Dj0Q</recordid><startdate>201012</startdate><enddate>201012</enddate><creator>Rosen, G L</creator><creator>Essinger, S D</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QF</scope><scope>7QO</scope><scope>7QQ</scope><scope>7SC</scope><scope>7SE</scope><scope>7SP</scope><scope>7SR</scope><scope>7TA</scope><scope>7TB</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>H8D</scope><scope>JG9</scope><scope>JQ2</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>P64</scope><scope>7X8</scope><scope>RC3</scope></search><sort><creationdate>201012</creationdate><title>Comparison of Statistical Methods to Classify Environmental Genomic Fragments</title><author>Rosen, G L ; Essinger, S D</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c424t-f08f6e9171ba7ec35b54f82720f1c2ad2c29822c5908d1cef6cd97aaefd6a4b13</frbrgroupid><rsrctype>magazinearticle</rsrctype><prefilter>magazinearticle</prefilter><language>eng</language><creationdate>2010</creationdate><topic>Accuracy</topic><topic>Bayes Theorem</topic><topic>Bayesian classification</topic><topic>Bioinformatics</topic><topic>Databases, Genetic</topic><topic>DNA</topic><topic>Genome</topic><topic>Genomics</topic><topic>language models</topic><topic>metagenomics</topic><topic>Metagenomics - methods</topic><topic>Models, Statistical</topic><topic>Peptide Fragments - classification</topic><topic>Sequence Analysis, DNA - methods</topic><topic>Statistical learning</topic><topic>Taxonomy</topic><topic>Training</topic><topic>Training data</topic><toplevel>online_resources</toplevel><creatorcontrib>Rosen, G L</creatorcontrib><creatorcontrib>Essinger, S D</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE/IET Electronic Library</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Aluminium Industry Abstracts</collection><collection>Biotechnology Research Abstracts</collection><collection>Ceramic Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Corrosion Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Materials Business File</collection><collection>Mechanical & Transportation Engineering Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology & Engineering</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><collection>Genetics Abstracts</collection><jtitle>IEEE transactions on nanobioscience</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Rosen, G L</au><au>Essinger, S D</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Comparison of Statistical Methods to Classify Environmental Genomic Fragments</atitle><jtitle>IEEE transactions on nanobioscience</jtitle><stitle>TNB</stitle><addtitle>IEEE Trans Nanobioscience</addtitle><date>2010-12</date><risdate>2010</risdate><volume>9</volume><issue>4</issue><spage>310</spage><epage>316</epage><pages>310-316</pages><issn>1536-1241</issn><eissn>1558-2639</eissn><coden>ITMCEL</coden><abstract>"Binning" (or taxonomic classification) of DNA sequence reads is an initial step to analyzing an environmental biological sample. Currently, a homology-based tool, BLAST, is one of the most commonly used tools to label DNA reads, but it is argued that BLAST will quickly lose its classification ability as the genome databases grow. In this paper, we compare the accuracies of a naïve Bayes classifier (NBC) and statistical language model to BLAST for binning reads and demonstrate that NBC obtains good performance for the low cost of computational complexity. On the other hand, the back-off n-gram language model can improve accuracy when only partial training data is available (such as in-progress sequencing projects). NBC demonstrates comparable performance to BLAST and can also be optimized on partial training datasets by adjusting the word feature size. A fivefold cross validation is conducted to compare each method's accuracy for determining novel genomes at different taxonomic levels, with NBC outperforming BLAST for species-level classification but BLAST outperforming NBC for genus-level and phyla-level classification. In conclusion, the NBC is a competitive taxonomic classifier, and language models can improve performance when only partial training data is available.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>20876033</pmid><doi>10.1109/TNB.2010.2081375</doi><tpages>7</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1536-1241
ispartof	IEEE transactions on nanobioscience, 2010-12, Vol.9 (4), p.310-316
issn	1536-1241 1558-2639
language	eng
recordid	cdi_pubmed_primary_20876033
source	IEEE/IET Electronic Library
subjects	Accuracy Bayes Theorem Bayesian classification Bioinformatics Databases, Genetic DNA Genome Genomics language models metagenomics Metagenomics - methods Models, Statistical Peptide Fragments - classification Sequence Analysis, DNA - methods Statistical learning Taxonomy Training Training data
title	Comparison of Statistical Methods to Classify Environmental Genomic Fragments
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T08%3A08%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Comparison%20of%20Statistical%20Methods%20to%20Classify%20Environmental%20Genomic%20Fragments&rft.jtitle=IEEE%20transactions%20on%20nanobioscience&rft.au=Rosen,%20G%20L&rft.date=2010-12&rft.volume=9&rft.issue=4&rft.spage=310&rft.epage=316&rft.pages=310-316&rft.issn=1536-1241&rft.eissn=1558-2639&rft.coden=ITMCEL&rft_id=info:doi/10.1109/TNB.2010.2081375&rft_dat=%3Cproquest_RIE%3E861564163%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=851267191&rft_id=info:pmid/20876033&rft_ieee_id=5586656&rfr_iscdi=true