Software for pre-processing Illumina next-generation sequencing short read sequences

When compared to Sanger sequencing technology, next-generation sequencing (NGS) technologies are hindered by shorter sequence read length, higher base-call error rate, non-uniform coverage, and platform-specific sequencing artifacts. These characteristics lower the quality of their downstream analys...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Source code for biology and medicine 2014-05, Vol.9 (1), p.8-8, Article 8
Hauptverfasser: Chen, Chuming, Khaleel, Sari S, Huang, Hongzhan, Wu, Cathy H
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 8
container_issue 1
container_start_page 8
container_title Source code for biology and medicine
container_volume 9
creator Chen, Chuming
Khaleel, Sari S
Huang, Hongzhan
Wu, Cathy H
description When compared to Sanger sequencing technology, next-generation sequencing (NGS) technologies are hindered by shorter sequence read length, higher base-call error rate, non-uniform coverage, and platform-specific sequencing artifacts. These characteristics lower the quality of their downstream analyses, e.g. de novo and reference-based assembly, by introducing sequencing artifacts and errors that may contribute to incorrect interpretation of data. Although many tools have been developed for quality control and pre-processing of NGS data, none of them provide flexible and comprehensive trimming options in conjunction with parallel processing to expedite pre-processing of large NGS datasets. We developed ngsShoRT (next-generation sequencing Short Reads Trimmer), a flexible and comprehensive open-source software package written in Perl that provides a set of algorithms commonly used for pre-processing NGS short read sequences. We compared the features and performance of ngsShoRT with existing tools: CutAdapt, NGS QC Toolkit and Trimmomatic. We also compared the effects of using pre-processed short read sequences generated by different algorithms on de novo and reference-based assembly for three different genomes: Caenorhabditis elegans, Saccharomyces cerevisiae S288c, and Escherichia coli O157 H7. Several combinations of ngsShoRT algorithms were tested on publicly available Illumina GA II, HiSeq 2000, and MiSeq eukaryotic and bacteria genomic short read sequences with the focus on removing sequencing artifacts and low-quality reads and/or bases. Our results show that across three organisms and three sequencing platforms, trimming improved the mean quality scores of trimmed sequences. Using trimmed sequences for de novo and reference-based assembly improved assembly quality as well as assembler performance. In general, ngsShoRT outperformed comparable trimming tools in terms of trimming speed and improvement of de novo and reference-based assembly as measured by assembly contiguity and correctness. Trimming of short read sequences can improve the quality of de novo and reference-based assembly and assembler performance. The parallel processing capability of ngsShoRT reduces trimming time and improves the memory efficiency when dealing with large datasets. We recommend combining sequencing artifacts removal, and quality score based read filtering and base trimming as the most consistent method for improving sequence quality and downstream assemblies. ngsShoRT so
doi_str_mv 10.1186/1751-0473-9-8
format Article
fullrecord <record><control><sourceid>gale_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_4064128</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A540652366</galeid><sourcerecordid>A540652366</sourcerecordid><originalsourceid>FETCH-LOGICAL-c548t-8295246154af5a74d6c30a7082f77cbdcd3fe62ff7f2eaef4665c30fc0b3d7533</originalsourceid><addsrcrecordid>eNptks9vFSEQx4nR2B969Go28eKFym_Yi0nTqG3SxIP1THjs8EqzC0_Yrfrfy6btszWGA8zwme_MMCD0hpITSo36QLWkmAjNcY_NM3S4t58_Oh-go1pvCJGUK_YSHTDRS0lJf4iuvuUw_3QFupBLtyuAdyV7qDWmbXcxjssUk-sS_JrxFhIUN8ecugo_Fkh-Zep1LnNXwA0PXqiv0Ivgxgqv7_dj9P3zp6uzc3z59cvF2ekl9lKYGRvWSyYUlcIF6bQYlOfEaWJY0NpvBj_wAIqFoAMDB0EoJRsRPNnwQUvOj9HHO93dsplg8JDm4ka7K3Fy5bfNLtqnNyle222-tYIoQZlpAu_vBUputdfZTrF6GEeXIC_VUsl73UAjGvruH_QmLyW19hrFJDGEC_OX2roRbEwht7x-FbWnsmWVjCvVqJP_UG0NMEWfE4TY_E8C8F2AL7nWAmHfIyV2_QZ2HbVdR217u5bx9vHD7OmHufM_mzWtNA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1525080348</pqid></control><display><type>article</type><title>Software for pre-processing Illumina next-generation sequencing short read sequences</title><source>PubMed Central Open Access</source><source>Access via BioMed Central</source><source>EZB-FREE-00999 freely available EZB journals</source><source>PubMed Central</source><source>Springer Nature OA/Free Journals</source><creator>Chen, Chuming ; Khaleel, Sari S ; Huang, Hongzhan ; Wu, Cathy H</creator><creatorcontrib>Chen, Chuming ; Khaleel, Sari S ; Huang, Hongzhan ; Wu, Cathy H</creatorcontrib><description>When compared to Sanger sequencing technology, next-generation sequencing (NGS) technologies are hindered by shorter sequence read length, higher base-call error rate, non-uniform coverage, and platform-specific sequencing artifacts. These characteristics lower the quality of their downstream analyses, e.g. de novo and reference-based assembly, by introducing sequencing artifacts and errors that may contribute to incorrect interpretation of data. Although many tools have been developed for quality control and pre-processing of NGS data, none of them provide flexible and comprehensive trimming options in conjunction with parallel processing to expedite pre-processing of large NGS datasets. We developed ngsShoRT (next-generation sequencing Short Reads Trimmer), a flexible and comprehensive open-source software package written in Perl that provides a set of algorithms commonly used for pre-processing NGS short read sequences. We compared the features and performance of ngsShoRT with existing tools: CutAdapt, NGS QC Toolkit and Trimmomatic. We also compared the effects of using pre-processed short read sequences generated by different algorithms on de novo and reference-based assembly for three different genomes: Caenorhabditis elegans, Saccharomyces cerevisiae S288c, and Escherichia coli O157 H7. Several combinations of ngsShoRT algorithms were tested on publicly available Illumina GA II, HiSeq 2000, and MiSeq eukaryotic and bacteria genomic short read sequences with the focus on removing sequencing artifacts and low-quality reads and/or bases. Our results show that across three organisms and three sequencing platforms, trimming improved the mean quality scores of trimmed sequences. Using trimmed sequences for de novo and reference-based assembly improved assembly quality as well as assembler performance. In general, ngsShoRT outperformed comparable trimming tools in terms of trimming speed and improvement of de novo and reference-based assembly as measured by assembly contiguity and correctness. Trimming of short read sequences can improve the quality of de novo and reference-based assembly and assembler performance. The parallel processing capability of ngsShoRT reduces trimming time and improves the memory efficiency when dealing with large datasets. We recommend combining sequencing artifacts removal, and quality score based read filtering and base trimming as the most consistent method for improving sequence quality and downstream assemblies. ngsShoRT source code, user guide and tutorial are available at http://research.bioinformatics.udel.edu/genomics/ngsShoRT/. ngsShoRT can be incorporated as a pre-processing step in genome and transcriptome assembly projects.</description><identifier>ISSN: 1751-0473</identifier><identifier>EISSN: 1751-0473</identifier><identifier>DOI: 10.1186/1751-0473-9-8</identifier><identifier>PMID: 24955109</identifier><language>eng</language><publisher>England: BioMed Central Ltd</publisher><subject>Algorithms ; Comparative analysis ; Data analysis ; DNA sequencing ; Escherichia coli ; Experiments ; Genomes ; Methods ; Multiprocessing ; Nucleotide sequencing ; Software packages ; Studies</subject><ispartof>Source code for biology and medicine, 2014-05, Vol.9 (1), p.8-8, Article 8</ispartof><rights>COPYRIGHT 2014 BioMed Central Ltd.</rights><rights>2014 Chen et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.</rights><rights>Copyright © 2014 Chen et al.; licensee BioMed Central Ltd. 2014 Chen et al.; licensee BioMed Central Ltd.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c548t-8295246154af5a74d6c30a7082f77cbdcd3fe62ff7f2eaef4665c30fc0b3d7533</citedby><cites>FETCH-LOGICAL-c548t-8295246154af5a74d6c30a7082f77cbdcd3fe62ff7f2eaef4665c30fc0b3d7533</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4064128/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4064128/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,727,780,784,885,27924,27925,53791,53793</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/24955109$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Chen, Chuming</creatorcontrib><creatorcontrib>Khaleel, Sari S</creatorcontrib><creatorcontrib>Huang, Hongzhan</creatorcontrib><creatorcontrib>Wu, Cathy H</creatorcontrib><title>Software for pre-processing Illumina next-generation sequencing short read sequences</title><title>Source code for biology and medicine</title><addtitle>Source Code Biol Med</addtitle><description>When compared to Sanger sequencing technology, next-generation sequencing (NGS) technologies are hindered by shorter sequence read length, higher base-call error rate, non-uniform coverage, and platform-specific sequencing artifacts. These characteristics lower the quality of their downstream analyses, e.g. de novo and reference-based assembly, by introducing sequencing artifacts and errors that may contribute to incorrect interpretation of data. Although many tools have been developed for quality control and pre-processing of NGS data, none of them provide flexible and comprehensive trimming options in conjunction with parallel processing to expedite pre-processing of large NGS datasets. We developed ngsShoRT (next-generation sequencing Short Reads Trimmer), a flexible and comprehensive open-source software package written in Perl that provides a set of algorithms commonly used for pre-processing NGS short read sequences. We compared the features and performance of ngsShoRT with existing tools: CutAdapt, NGS QC Toolkit and Trimmomatic. We also compared the effects of using pre-processed short read sequences generated by different algorithms on de novo and reference-based assembly for three different genomes: Caenorhabditis elegans, Saccharomyces cerevisiae S288c, and Escherichia coli O157 H7. Several combinations of ngsShoRT algorithms were tested on publicly available Illumina GA II, HiSeq 2000, and MiSeq eukaryotic and bacteria genomic short read sequences with the focus on removing sequencing artifacts and low-quality reads and/or bases. Our results show that across three organisms and three sequencing platforms, trimming improved the mean quality scores of trimmed sequences. Using trimmed sequences for de novo and reference-based assembly improved assembly quality as well as assembler performance. In general, ngsShoRT outperformed comparable trimming tools in terms of trimming speed and improvement of de novo and reference-based assembly as measured by assembly contiguity and correctness. Trimming of short read sequences can improve the quality of de novo and reference-based assembly and assembler performance. The parallel processing capability of ngsShoRT reduces trimming time and improves the memory efficiency when dealing with large datasets. We recommend combining sequencing artifacts removal, and quality score based read filtering and base trimming as the most consistent method for improving sequence quality and downstream assemblies. ngsShoRT source code, user guide and tutorial are available at http://research.bioinformatics.udel.edu/genomics/ngsShoRT/. ngsShoRT can be incorporated as a pre-processing step in genome and transcriptome assembly projects.</description><subject>Algorithms</subject><subject>Comparative analysis</subject><subject>Data analysis</subject><subject>DNA sequencing</subject><subject>Escherichia coli</subject><subject>Experiments</subject><subject>Genomes</subject><subject>Methods</subject><subject>Multiprocessing</subject><subject>Nucleotide sequencing</subject><subject>Software packages</subject><subject>Studies</subject><issn>1751-0473</issn><issn>1751-0473</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2014</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNptks9vFSEQx4nR2B969Go28eKFym_Yi0nTqG3SxIP1THjs8EqzC0_Yrfrfy6btszWGA8zwme_MMCD0hpITSo36QLWkmAjNcY_NM3S4t58_Oh-go1pvCJGUK_YSHTDRS0lJf4iuvuUw_3QFupBLtyuAdyV7qDWmbXcxjssUk-sS_JrxFhIUN8ecugo_Fkh-Zep1LnNXwA0PXqiv0Ivgxgqv7_dj9P3zp6uzc3z59cvF2ekl9lKYGRvWSyYUlcIF6bQYlOfEaWJY0NpvBj_wAIqFoAMDB0EoJRsRPNnwQUvOj9HHO93dsplg8JDm4ka7K3Fy5bfNLtqnNyle222-tYIoQZlpAu_vBUputdfZTrF6GEeXIC_VUsl73UAjGvruH_QmLyW19hrFJDGEC_OX2roRbEwht7x-FbWnsmWVjCvVqJP_UG0NMEWfE4TY_E8C8F2AL7nWAmHfIyV2_QZ2HbVdR217u5bx9vHD7OmHufM_mzWtNA</recordid><startdate>20140503</startdate><enddate>20140503</enddate><creator>Chen, Chuming</creator><creator>Khaleel, Sari S</creator><creator>Huang, Hongzhan</creator><creator>Wu, Cathy H</creator><general>BioMed Central Ltd</general><general>BioMed Central</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7X7</scope><scope>7XB</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FR3</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>K9.</scope><scope>L6V</scope><scope>LK8</scope><scope>M0S</scope><scope>M7P</scope><scope>M7S</scope><scope>M7Z</scope><scope>P64</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>7X8</scope><scope>5PM</scope></search><sort><creationdate>20140503</creationdate><title>Software for pre-processing Illumina next-generation sequencing short read sequences</title><author>Chen, Chuming ; Khaleel, Sari S ; Huang, Hongzhan ; Wu, Cathy H</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c548t-8295246154af5a74d6c30a7082f77cbdcd3fe62ff7f2eaef4665c30fc0b3d7533</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2014</creationdate><topic>Algorithms</topic><topic>Comparative analysis</topic><topic>Data analysis</topic><topic>DNA sequencing</topic><topic>Escherichia coli</topic><topic>Experiments</topic><topic>Genomes</topic><topic>Methods</topic><topic>Multiprocessing</topic><topic>Nucleotide sequencing</topic><topic>Software packages</topic><topic>Studies</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Chen, Chuming</creatorcontrib><creatorcontrib>Khaleel, Sari S</creatorcontrib><creatorcontrib>Huang, Hongzhan</creatorcontrib><creatorcontrib>Wu, Cathy H</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Health &amp; Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Engineering Research Database</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>ProQuest Engineering Collection</collection><collection>ProQuest Biological Science Collection</collection><collection>Health &amp; Medical Collection (Alumni Edition)</collection><collection>Biological Science Database</collection><collection>Engineering Database</collection><collection>Biochemistry Abstracts 1</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Source code for biology and medicine</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Chen, Chuming</au><au>Khaleel, Sari S</au><au>Huang, Hongzhan</au><au>Wu, Cathy H</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Software for pre-processing Illumina next-generation sequencing short read sequences</atitle><jtitle>Source code for biology and medicine</jtitle><addtitle>Source Code Biol Med</addtitle><date>2014-05-03</date><risdate>2014</risdate><volume>9</volume><issue>1</issue><spage>8</spage><epage>8</epage><pages>8-8</pages><artnum>8</artnum><issn>1751-0473</issn><eissn>1751-0473</eissn><abstract>When compared to Sanger sequencing technology, next-generation sequencing (NGS) technologies are hindered by shorter sequence read length, higher base-call error rate, non-uniform coverage, and platform-specific sequencing artifacts. These characteristics lower the quality of their downstream analyses, e.g. de novo and reference-based assembly, by introducing sequencing artifacts and errors that may contribute to incorrect interpretation of data. Although many tools have been developed for quality control and pre-processing of NGS data, none of them provide flexible and comprehensive trimming options in conjunction with parallel processing to expedite pre-processing of large NGS datasets. We developed ngsShoRT (next-generation sequencing Short Reads Trimmer), a flexible and comprehensive open-source software package written in Perl that provides a set of algorithms commonly used for pre-processing NGS short read sequences. We compared the features and performance of ngsShoRT with existing tools: CutAdapt, NGS QC Toolkit and Trimmomatic. We also compared the effects of using pre-processed short read sequences generated by different algorithms on de novo and reference-based assembly for three different genomes: Caenorhabditis elegans, Saccharomyces cerevisiae S288c, and Escherichia coli O157 H7. Several combinations of ngsShoRT algorithms were tested on publicly available Illumina GA II, HiSeq 2000, and MiSeq eukaryotic and bacteria genomic short read sequences with the focus on removing sequencing artifacts and low-quality reads and/or bases. Our results show that across three organisms and three sequencing platforms, trimming improved the mean quality scores of trimmed sequences. Using trimmed sequences for de novo and reference-based assembly improved assembly quality as well as assembler performance. In general, ngsShoRT outperformed comparable trimming tools in terms of trimming speed and improvement of de novo and reference-based assembly as measured by assembly contiguity and correctness. Trimming of short read sequences can improve the quality of de novo and reference-based assembly and assembler performance. The parallel processing capability of ngsShoRT reduces trimming time and improves the memory efficiency when dealing with large datasets. We recommend combining sequencing artifacts removal, and quality score based read filtering and base trimming as the most consistent method for improving sequence quality and downstream assemblies. ngsShoRT source code, user guide and tutorial are available at http://research.bioinformatics.udel.edu/genomics/ngsShoRT/. ngsShoRT can be incorporated as a pre-processing step in genome and transcriptome assembly projects.</abstract><cop>England</cop><pub>BioMed Central Ltd</pub><pmid>24955109</pmid><doi>10.1186/1751-0473-9-8</doi><tpages>1</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1751-0473
ispartof Source code for biology and medicine, 2014-05, Vol.9 (1), p.8-8, Article 8
issn 1751-0473
1751-0473
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_4064128
source PubMed Central Open Access; Access via BioMed Central; EZB-FREE-00999 freely available EZB journals; PubMed Central; Springer Nature OA/Free Journals
subjects Algorithms
Comparative analysis
Data analysis
DNA sequencing
Escherichia coli
Experiments
Genomes
Methods
Multiprocessing
Nucleotide sequencing
Software packages
Studies
title Software for pre-processing Illumina next-generation sequencing short read sequences
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T19%3A38%3A48IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Software%20for%20pre-processing%20Illumina%20next-generation%20sequencing%20short%20read%20sequences&rft.jtitle=Source%20code%20for%20biology%20and%20medicine&rft.au=Chen,%20Chuming&rft.date=2014-05-03&rft.volume=9&rft.issue=1&rft.spage=8&rft.epage=8&rft.pages=8-8&rft.artnum=8&rft.issn=1751-0473&rft.eissn=1751-0473&rft_id=info:doi/10.1186/1751-0473-9-8&rft_dat=%3Cgale_pubme%3EA540652366%3C/gale_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1525080348&rft_id=info:pmid/24955109&rft_galeid=A540652366&rfr_iscdi=true