Accuracy of de novo assembly of DNA sequences from double‐digest libraries varies substantially among software

Advances in DNA sequencing have made it feasible to gather genomic data for non‐model organisms and large sets of individuals, often using methods for sequencing subsets of the genome. Several of these methods sequence DNA associated with endonuclease restriction sites (various RAD and GBS methods)....

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Molecular ecology resources 2020-03, Vol.20 (2), p.360-370
Hauptverfasser: LaCava, Melanie E. F., Aikens, Ellen O., Megna, Libby C., Randolph, Gregg, Hubbard, Charley, Buerkle, C. Alex
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 370
container_issue 2
container_start_page 360
container_title Molecular ecology resources
container_volume 20
creator LaCava, Melanie E. F.
Aikens, Ellen O.
Megna, Libby C.
Randolph, Gregg
Hubbard, Charley
Buerkle, C. Alex
description Advances in DNA sequencing have made it feasible to gather genomic data for non‐model organisms and large sets of individuals, often using methods for sequencing subsets of the genome. Several of these methods sequence DNA associated with endonuclease restriction sites (various RAD and GBS methods). For use in taxa without a reference genome, these methods rely on de novo assembly of fragments in the sequencing library. Many of the software options available for this application were originally developed for other assembly types and we do not know their accuracy for reduced representation libraries. To address this important knowledge gap, we simulated data from the Arabidopsis thaliana and Homo sapiens genomes and compared de novo assemblies by six software programs that are commonly used or promising for this purpose (ABySS, CD‐HIT, Stacks, Stacks2, Velvet and VSEARCH). We simulated different mutation rates and types of mutations, and then applied the six assemblers to the simulated data sets, varying assembly parameters. We found substantial variation in software performance across simulations and parameter settings. ABySS failed to recover any true genome fragments, and Velvet and VSEARCH performed poorly for most simulations. Stacks and Stacks2 produced accurate assemblies of simulations containing SNPs, but the addition of insertion and deletion mutations decreased their performance. CD‐HIT was the only assembler that consistently recovered a high proportion of true genome fragments. Here, we demonstrate the substantial difference in the accuracy of assemblies from different software programs and the importance of comparing assemblies that result from different parameter settings.
doi_str_mv 10.1111/1755-0998.13108
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_2310716050</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2366246014</sourcerecordid><originalsourceid>FETCH-LOGICAL-c4128-ca08f3fc19e49958ac8d631b9d41025825611fc525075e733664879b2976ef633</originalsourceid><addsrcrecordid>eNqFkbtOwzAUhi0EgnKZ2ZAlFpaCncSOPVblKnFZQGKzHOcYBSVxsZNW3XgEnpEnwSXQgQUvxzr-_OnoPwgdUnJK4zmjOWNjIqU4pSklYgON1p3N9V0876DdEF4J4UTm2TbaSSnnjGX5CM0mxvRemyV2FpeAWzd3WIcATVF_987vJzjAWw-tgYCtdw0uXV_U8Pn-UVYvEDpcV4XXvorP86GEvgidbrtK11GiG9e-4OBst9Ae9tGW1XWAg5-6h54uLx6n1-Pbh6ub6eR2bDKaiLHRRNjUGiohk5IJbUTJU1rIMqMkYSJhnFJrWMJIziBPU84zkcsikTkHy9N0D50M3pl3cfrQqaYKBupat-D6oJKYV045YSSix3_QV9f7Nk4XKc6TjBOaRepsoIx3IXiwauarRvulokStlqFWcatV9Op7GfHH0Y-3Lxoo1_xv-hFgA7Coalj-51N3F_eD-AvTvpPw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2366246014</pqid></control><display><type>article</type><title>Accuracy of de novo assembly of DNA sequences from double‐digest libraries varies substantially among software</title><source>Wiley Online Library Journals Frontfile Complete</source><creator>LaCava, Melanie E. F. ; Aikens, Ellen O. ; Megna, Libby C. ; Randolph, Gregg ; Hubbard, Charley ; Buerkle, C. Alex</creator><creatorcontrib>LaCava, Melanie E. F. ; Aikens, Ellen O. ; Megna, Libby C. ; Randolph, Gregg ; Hubbard, Charley ; Buerkle, C. Alex</creatorcontrib><description>Advances in DNA sequencing have made it feasible to gather genomic data for non‐model organisms and large sets of individuals, often using methods for sequencing subsets of the genome. Several of these methods sequence DNA associated with endonuclease restriction sites (various RAD and GBS methods). For use in taxa without a reference genome, these methods rely on de novo assembly of fragments in the sequencing library. Many of the software options available for this application were originally developed for other assembly types and we do not know their accuracy for reduced representation libraries. To address this important knowledge gap, we simulated data from the Arabidopsis thaliana and Homo sapiens genomes and compared de novo assemblies by six software programs that are commonly used or promising for this purpose (ABySS, CD‐HIT, Stacks, Stacks2, Velvet and VSEARCH). We simulated different mutation rates and types of mutations, and then applied the six assemblers to the simulated data sets, varying assembly parameters. We found substantial variation in software performance across simulations and parameter settings. ABySS failed to recover any true genome fragments, and Velvet and VSEARCH performed poorly for most simulations. Stacks and Stacks2 produced accurate assemblies of simulations containing SNPs, but the addition of insertion and deletion mutations decreased their performance. CD‐HIT was the only assembler that consistently recovered a high proportion of true genome fragments. Here, we demonstrate the substantial difference in the accuracy of assemblies from different software programs and the importance of comparing assemblies that result from different parameter settings.</description><identifier>ISSN: 1755-098X</identifier><identifier>EISSN: 1755-0998</identifier><identifier>DOI: 10.1111/1755-0998.13108</identifier><identifier>PMID: 31665547</identifier><language>eng</language><publisher>England: Wiley Subscription Services, Inc</publisher><subject>Accuracy ; Assemblies ; Assembly ; Computer programs ; Computer simulation ; Deoxyribonucleic acid ; DNA ; DNA sequencing ; Endonuclease ; Fragments ; GBS ; Gene deletion ; Genomes ; genomics ; indels ; Insertion ; Mutation ; Mutation rates ; Nucleotide sequence ; paralogs ; Parameters ; population ; RAD ; reference genome ; Simulation ; Single-nucleotide polymorphism ; Software ; Stacks</subject><ispartof>Molecular ecology resources, 2020-03, Vol.20 (2), p.360-370</ispartof><rights>2019 John Wiley &amp; Sons Ltd</rights><rights>2019 John Wiley &amp; Sons Ltd.</rights><rights>Copyright © 2020 John Wiley &amp; Sons Ltd</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c4128-ca08f3fc19e49958ac8d631b9d41025825611fc525075e733664879b2976ef633</citedby><cites>FETCH-LOGICAL-c4128-ca08f3fc19e49958ac8d631b9d41025825611fc525075e733664879b2976ef633</cites><orcidid>0000-0003-4222-8858 ; 0000-0003-0827-3006 ; 0000-0001-7921-9184 ; 0000-0003-3887-5729</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://onlinelibrary.wiley.com/doi/pdf/10.1111%2F1755-0998.13108$$EPDF$$P50$$Gwiley$$H</linktopdf><linktohtml>$$Uhttps://onlinelibrary.wiley.com/doi/full/10.1111%2F1755-0998.13108$$EHTML$$P50$$Gwiley$$H</linktohtml><link.rule.ids>314,776,780,1411,27901,27902,45550,45551</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/31665547$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>LaCava, Melanie E. F.</creatorcontrib><creatorcontrib>Aikens, Ellen O.</creatorcontrib><creatorcontrib>Megna, Libby C.</creatorcontrib><creatorcontrib>Randolph, Gregg</creatorcontrib><creatorcontrib>Hubbard, Charley</creatorcontrib><creatorcontrib>Buerkle, C. Alex</creatorcontrib><title>Accuracy of de novo assembly of DNA sequences from double‐digest libraries varies substantially among software</title><title>Molecular ecology resources</title><addtitle>Mol Ecol Resour</addtitle><description>Advances in DNA sequencing have made it feasible to gather genomic data for non‐model organisms and large sets of individuals, often using methods for sequencing subsets of the genome. Several of these methods sequence DNA associated with endonuclease restriction sites (various RAD and GBS methods). For use in taxa without a reference genome, these methods rely on de novo assembly of fragments in the sequencing library. Many of the software options available for this application were originally developed for other assembly types and we do not know their accuracy for reduced representation libraries. To address this important knowledge gap, we simulated data from the Arabidopsis thaliana and Homo sapiens genomes and compared de novo assemblies by six software programs that are commonly used or promising for this purpose (ABySS, CD‐HIT, Stacks, Stacks2, Velvet and VSEARCH). We simulated different mutation rates and types of mutations, and then applied the six assemblers to the simulated data sets, varying assembly parameters. We found substantial variation in software performance across simulations and parameter settings. ABySS failed to recover any true genome fragments, and Velvet and VSEARCH performed poorly for most simulations. Stacks and Stacks2 produced accurate assemblies of simulations containing SNPs, but the addition of insertion and deletion mutations decreased their performance. CD‐HIT was the only assembler that consistently recovered a high proportion of true genome fragments. Here, we demonstrate the substantial difference in the accuracy of assemblies from different software programs and the importance of comparing assemblies that result from different parameter settings.</description><subject>Accuracy</subject><subject>Assemblies</subject><subject>Assembly</subject><subject>Computer programs</subject><subject>Computer simulation</subject><subject>Deoxyribonucleic acid</subject><subject>DNA</subject><subject>DNA sequencing</subject><subject>Endonuclease</subject><subject>Fragments</subject><subject>GBS</subject><subject>Gene deletion</subject><subject>Genomes</subject><subject>genomics</subject><subject>indels</subject><subject>Insertion</subject><subject>Mutation</subject><subject>Mutation rates</subject><subject>Nucleotide sequence</subject><subject>paralogs</subject><subject>Parameters</subject><subject>population</subject><subject>RAD</subject><subject>reference genome</subject><subject>Simulation</subject><subject>Single-nucleotide polymorphism</subject><subject>Software</subject><subject>Stacks</subject><issn>1755-098X</issn><issn>1755-0998</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><recordid>eNqFkbtOwzAUhi0EgnKZ2ZAlFpaCncSOPVblKnFZQGKzHOcYBSVxsZNW3XgEnpEnwSXQgQUvxzr-_OnoPwgdUnJK4zmjOWNjIqU4pSklYgON1p3N9V0876DdEF4J4UTm2TbaSSnnjGX5CM0mxvRemyV2FpeAWzd3WIcATVF_987vJzjAWw-tgYCtdw0uXV_U8Pn-UVYvEDpcV4XXvorP86GEvgidbrtK11GiG9e-4OBst9Ae9tGW1XWAg5-6h54uLx6n1-Pbh6ub6eR2bDKaiLHRRNjUGiohk5IJbUTJU1rIMqMkYSJhnFJrWMJIziBPU84zkcsikTkHy9N0D50M3pl3cfrQqaYKBupat-D6oJKYV045YSSix3_QV9f7Nk4XKc6TjBOaRepsoIx3IXiwauarRvulokStlqFWcatV9Op7GfHH0Y-3Lxoo1_xv-hFgA7Coalj-51N3F_eD-AvTvpPw</recordid><startdate>202003</startdate><enddate>202003</enddate><creator>LaCava, Melanie E. F.</creator><creator>Aikens, Ellen O.</creator><creator>Megna, Libby C.</creator><creator>Randolph, Gregg</creator><creator>Hubbard, Charley</creator><creator>Buerkle, C. Alex</creator><general>Wiley Subscription Services, Inc</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SN</scope><scope>7SS</scope><scope>8FD</scope><scope>C1K</scope><scope>FR3</scope><scope>M7N</scope><scope>P64</scope><scope>RC3</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0003-4222-8858</orcidid><orcidid>https://orcid.org/0000-0003-0827-3006</orcidid><orcidid>https://orcid.org/0000-0001-7921-9184</orcidid><orcidid>https://orcid.org/0000-0003-3887-5729</orcidid></search><sort><creationdate>202003</creationdate><title>Accuracy of de novo assembly of DNA sequences from double‐digest libraries varies substantially among software</title><author>LaCava, Melanie E. F. ; Aikens, Ellen O. ; Megna, Libby C. ; Randolph, Gregg ; Hubbard, Charley ; Buerkle, C. Alex</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c4128-ca08f3fc19e49958ac8d631b9d41025825611fc525075e733664879b2976ef633</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Accuracy</topic><topic>Assemblies</topic><topic>Assembly</topic><topic>Computer programs</topic><topic>Computer simulation</topic><topic>Deoxyribonucleic acid</topic><topic>DNA</topic><topic>DNA sequencing</topic><topic>Endonuclease</topic><topic>Fragments</topic><topic>GBS</topic><topic>Gene deletion</topic><topic>Genomes</topic><topic>genomics</topic><topic>indels</topic><topic>Insertion</topic><topic>Mutation</topic><topic>Mutation rates</topic><topic>Nucleotide sequence</topic><topic>paralogs</topic><topic>Parameters</topic><topic>population</topic><topic>RAD</topic><topic>reference genome</topic><topic>Simulation</topic><topic>Single-nucleotide polymorphism</topic><topic>Software</topic><topic>Stacks</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>LaCava, Melanie E. F.</creatorcontrib><creatorcontrib>Aikens, Ellen O.</creatorcontrib><creatorcontrib>Megna, Libby C.</creatorcontrib><creatorcontrib>Randolph, Gregg</creatorcontrib><creatorcontrib>Hubbard, Charley</creatorcontrib><creatorcontrib>Buerkle, C. Alex</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>Ecology Abstracts</collection><collection>Entomology Abstracts (Full archive)</collection><collection>Technology Research Database</collection><collection>Environmental Sciences and Pollution Management</collection><collection>Engineering Research Database</collection><collection>Algology Mycology and Protozoology Abstracts (Microbiology C)</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Genetics Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>Molecular ecology resources</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>LaCava, Melanie E. F.</au><au>Aikens, Ellen O.</au><au>Megna, Libby C.</au><au>Randolph, Gregg</au><au>Hubbard, Charley</au><au>Buerkle, C. Alex</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Accuracy of de novo assembly of DNA sequences from double‐digest libraries varies substantially among software</atitle><jtitle>Molecular ecology resources</jtitle><addtitle>Mol Ecol Resour</addtitle><date>2020-03</date><risdate>2020</risdate><volume>20</volume><issue>2</issue><spage>360</spage><epage>370</epage><pages>360-370</pages><issn>1755-098X</issn><eissn>1755-0998</eissn><abstract>Advances in DNA sequencing have made it feasible to gather genomic data for non‐model organisms and large sets of individuals, often using methods for sequencing subsets of the genome. Several of these methods sequence DNA associated with endonuclease restriction sites (various RAD and GBS methods). For use in taxa without a reference genome, these methods rely on de novo assembly of fragments in the sequencing library. Many of the software options available for this application were originally developed for other assembly types and we do not know their accuracy for reduced representation libraries. To address this important knowledge gap, we simulated data from the Arabidopsis thaliana and Homo sapiens genomes and compared de novo assemblies by six software programs that are commonly used or promising for this purpose (ABySS, CD‐HIT, Stacks, Stacks2, Velvet and VSEARCH). We simulated different mutation rates and types of mutations, and then applied the six assemblers to the simulated data sets, varying assembly parameters. We found substantial variation in software performance across simulations and parameter settings. ABySS failed to recover any true genome fragments, and Velvet and VSEARCH performed poorly for most simulations. Stacks and Stacks2 produced accurate assemblies of simulations containing SNPs, but the addition of insertion and deletion mutations decreased their performance. CD‐HIT was the only assembler that consistently recovered a high proportion of true genome fragments. Here, we demonstrate the substantial difference in the accuracy of assemblies from different software programs and the importance of comparing assemblies that result from different parameter settings.</abstract><cop>England</cop><pub>Wiley Subscription Services, Inc</pub><pmid>31665547</pmid><doi>10.1111/1755-0998.13108</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0003-4222-8858</orcidid><orcidid>https://orcid.org/0000-0003-0827-3006</orcidid><orcidid>https://orcid.org/0000-0001-7921-9184</orcidid><orcidid>https://orcid.org/0000-0003-3887-5729</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1755-098X
ispartof Molecular ecology resources, 2020-03, Vol.20 (2), p.360-370
issn 1755-098X
1755-0998
language eng
recordid cdi_proquest_miscellaneous_2310716050
source Wiley Online Library Journals Frontfile Complete
subjects Accuracy
Assemblies
Assembly
Computer programs
Computer simulation
Deoxyribonucleic acid
DNA
DNA sequencing
Endonuclease
Fragments
GBS
Gene deletion
Genomes
genomics
indels
Insertion
Mutation
Mutation rates
Nucleotide sequence
paralogs
Parameters
population
RAD
reference genome
Simulation
Single-nucleotide polymorphism
Software
Stacks
title Accuracy of de novo assembly of DNA sequences from double‐digest libraries varies substantially among software
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-02T05%3A59%3A17IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Accuracy%20of%20de%20novo%20assembly%20of%20DNA%20sequences%20from%20double%E2%80%90digest%20libraries%20varies%20substantially%20among%20software&rft.jtitle=Molecular%20ecology%20resources&rft.au=LaCava,%20Melanie%20E.%20F.&rft.date=2020-03&rft.volume=20&rft.issue=2&rft.spage=360&rft.epage=370&rft.pages=360-370&rft.issn=1755-098X&rft.eissn=1755-0998&rft_id=info:doi/10.1111/1755-0998.13108&rft_dat=%3Cproquest_cross%3E2366246014%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2366246014&rft_id=info:pmid/31665547&rfr_iscdi=true