The SAMBA tool uses long reads to improve the contiguity of genome assemblies
Third-generation sequencing technologies can generate very long reads with relatively high error rates. The lengths of the reads, which sometimes exceed one million bases, make them invaluable for resolving complex repeats that cannot be assembled using shorter reads. Many high-quality genome assemb...
Gespeichert in:
Veröffentlicht in: | PLoS computational biology 2022-02, Vol.18 (2), p.e1009860-e1009860 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | e1009860 |
---|---|
container_issue | 2 |
container_start_page | e1009860 |
container_title | PLoS computational biology |
container_volume | 18 |
creator | Zimin, Aleksey V Salzberg, Steven L |
description | Third-generation sequencing technologies can generate very long reads with relatively high error rates. The lengths of the reads, which sometimes exceed one million bases, make them invaluable for resolving complex repeats that cannot be assembled using shorter reads. Many high-quality genome assemblies have already been produced, curated, and annotated using the previous generation of sequencing data, and full re-assembly of these genomes with long reads is not always practical or cost-effective. One strategy to upgrade existing assemblies is to generate additional coverage using long-read data, and add that to the previously assembled contigs. SAMBA is a tool that is designed to scaffold and gap-fill existing genome assemblies with additional long-read data, resulting in substantially greater contiguity. SAMBA is the only tool of its kind that also computes and fills in the sequence for all spanned gaps in the scaffolds, yielding much longer contigs. Here we compare SAMBA to several similar tools capable of re-scaffolding assemblies using long-read data, and we show that SAMBA yields better contiguity and introduces fewer errors than competing methods. SAMBA is open-source software that is distributed at https://github.com/alekseyzimin/masurca. |
doi_str_mv | 10.1371/journal.pcbi.1009860 |
format | Article |
fullrecord | <record><control><sourceid>gale_plos_</sourceid><recordid>TN_cdi_plos_journals_2640120387</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A695461094</galeid><doaj_id>oai_doaj_org_article_e59d3dc3f38340868488f1b80981ac2d</doaj_id><sourcerecordid>A695461094</sourcerecordid><originalsourceid>FETCH-LOGICAL-c633t-948bb8e87b7939a7fea9f80e94efb22397ad6570fa377c0aa102dfb977cbf18a3</originalsourceid><addsrcrecordid>eNqVkl1vFCEUhidGY2v1Hxgl8UYvdoVhZoAbk7XxY5NWE1uvCTCHKRtm2MJMY_-9rDttusYbwwVweM57OB9F8ZLgJaGMvN-EKQ7KL7dGuyXBWPAGPyqOSV3TBaM1f_zgfFQ8S2mDcT6K5mlxRGtSYkLEcXF-eQXoYnX-cYXGEDyaEiTkw9ChCKpN2Yhcv43hBtCYSROG0XWTG29RsKiDIfSAVErQa-8gPS-eWOUTvJj3k-Ln50-Xp18XZ9-_rE9XZwvTUDouRMW15sCZZoIKxSwoYTkGUYHVZUkFU21TM2wVZcxgpQguW6tFvmhLuKInxeu97taHJOdKJFk2Fc6JUc4ysd4TbVAbuY2uV_FWBuXkH0OInVRxdMaDhFq0tDXUUk4rzBtecW6J5rmiRJmyzVof5miT7qE1MIxR-QPRw5fBXcku3EjOK1FjngXezgIxXE-QRtm7ZMB7NUCYdv8um9zASpQZffMX-u_slnuqUzkBN9iQ45q8Wuhd7hFYl-2rRtRVQ7Jwdnh34LDrI_waOzWlJNcXP_6D_XbIVnvWxJBSBHtfFYLlbkzvvi93YyrnMc1urx5W9N7pbi7pb11H4xc</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2640120387</pqid></control><display><type>article</type><title>The SAMBA tool uses long reads to improve the contiguity of genome assemblies</title><source>Public Library of Science (PLoS) Journals Open Access</source><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>PubMed Central</source><creator>Zimin, Aleksey V ; Salzberg, Steven L</creator><contributor>Shao, Mingfu</contributor><creatorcontrib>Zimin, Aleksey V ; Salzberg, Steven L ; Shao, Mingfu</creatorcontrib><description>Third-generation sequencing technologies can generate very long reads with relatively high error rates. The lengths of the reads, which sometimes exceed one million bases, make them invaluable for resolving complex repeats that cannot be assembled using shorter reads. Many high-quality genome assemblies have already been produced, curated, and annotated using the previous generation of sequencing data, and full re-assembly of these genomes with long reads is not always practical or cost-effective. One strategy to upgrade existing assemblies is to generate additional coverage using long-read data, and add that to the previously assembled contigs. SAMBA is a tool that is designed to scaffold and gap-fill existing genome assemblies with additional long-read data, resulting in substantially greater contiguity. SAMBA is the only tool of its kind that also computes and fills in the sequence for all spanned gaps in the scaffolds, yielding much longer contigs. Here we compare SAMBA to several similar tools capable of re-scaffolding assemblies using long-read data, and we show that SAMBA yields better contiguity and introduces fewer errors than competing methods. SAMBA is open-source software that is distributed at https://github.com/alekseyzimin/masurca.</description><identifier>ISSN: 1553-7358</identifier><identifier>ISSN: 1553-734X</identifier><identifier>EISSN: 1553-7358</identifier><identifier>DOI: 10.1371/journal.pcbi.1009860</identifier><identifier>PMID: 35120119</identifier><language>eng</language><publisher>United States: Public Library of Science</publisher><subject>Analysis ; Assemblies ; Biology and Life Sciences ; Computer and Information Sciences ; Engineering and Technology ; Gene sequencing ; Genomes ; Genomics ; High-Throughput Nucleotide Sequencing - methods ; Open source software ; Public software ; Research and Analysis Methods ; Scaffolding ; Scaffolds ; Science Policy ; Software</subject><ispartof>PLoS computational biology, 2022-02, Vol.18 (2), p.e1009860-e1009860</ispartof><rights>COPYRIGHT 2022 Public Library of Science</rights><rights>2022 Zimin, Salzberg. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>2022 Zimin, Salzberg 2022 Zimin, Salzberg</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c633t-948bb8e87b7939a7fea9f80e94efb22397ad6570fa377c0aa102dfb977cbf18a3</citedby><cites>FETCH-LOGICAL-c633t-948bb8e87b7939a7fea9f80e94efb22397ad6570fa377c0aa102dfb977cbf18a3</cites><orcidid>0000-0002-8859-7432 ; 0000-0001-5091-3092</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC8849508/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC8849508/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,315,729,782,786,866,887,2106,2932,23875,27933,27934,53800,53802</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/35120119$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><contributor>Shao, Mingfu</contributor><creatorcontrib>Zimin, Aleksey V</creatorcontrib><creatorcontrib>Salzberg, Steven L</creatorcontrib><title>The SAMBA tool uses long reads to improve the contiguity of genome assemblies</title><title>PLoS computational biology</title><addtitle>PLoS Comput Biol</addtitle><description>Third-generation sequencing technologies can generate very long reads with relatively high error rates. The lengths of the reads, which sometimes exceed one million bases, make them invaluable for resolving complex repeats that cannot be assembled using shorter reads. Many high-quality genome assemblies have already been produced, curated, and annotated using the previous generation of sequencing data, and full re-assembly of these genomes with long reads is not always practical or cost-effective. One strategy to upgrade existing assemblies is to generate additional coverage using long-read data, and add that to the previously assembled contigs. SAMBA is a tool that is designed to scaffold and gap-fill existing genome assemblies with additional long-read data, resulting in substantially greater contiguity. SAMBA is the only tool of its kind that also computes and fills in the sequence for all spanned gaps in the scaffolds, yielding much longer contigs. Here we compare SAMBA to several similar tools capable of re-scaffolding assemblies using long-read data, and we show that SAMBA yields better contiguity and introduces fewer errors than competing methods. SAMBA is open-source software that is distributed at https://github.com/alekseyzimin/masurca.</description><subject>Analysis</subject><subject>Assemblies</subject><subject>Biology and Life Sciences</subject><subject>Computer and Information Sciences</subject><subject>Engineering and Technology</subject><subject>Gene sequencing</subject><subject>Genomes</subject><subject>Genomics</subject><subject>High-Throughput Nucleotide Sequencing - methods</subject><subject>Open source software</subject><subject>Public software</subject><subject>Research and Analysis Methods</subject><subject>Scaffolding</subject><subject>Scaffolds</subject><subject>Science Policy</subject><subject>Software</subject><issn>1553-7358</issn><issn>1553-734X</issn><issn>1553-7358</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><sourceid>DOA</sourceid><recordid>eNqVkl1vFCEUhidGY2v1Hxgl8UYvdoVhZoAbk7XxY5NWE1uvCTCHKRtm2MJMY_-9rDttusYbwwVweM57OB9F8ZLgJaGMvN-EKQ7KL7dGuyXBWPAGPyqOSV3TBaM1f_zgfFQ8S2mDcT6K5mlxRGtSYkLEcXF-eQXoYnX-cYXGEDyaEiTkw9ChCKpN2Yhcv43hBtCYSROG0XWTG29RsKiDIfSAVErQa-8gPS-eWOUTvJj3k-Ln50-Xp18XZ9-_rE9XZwvTUDouRMW15sCZZoIKxSwoYTkGUYHVZUkFU21TM2wVZcxgpQguW6tFvmhLuKInxeu97taHJOdKJFk2Fc6JUc4ysd4TbVAbuY2uV_FWBuXkH0OInVRxdMaDhFq0tDXUUk4rzBtecW6J5rmiRJmyzVof5miT7qE1MIxR-QPRw5fBXcku3EjOK1FjngXezgIxXE-QRtm7ZMB7NUCYdv8um9zASpQZffMX-u_slnuqUzkBN9iQ45q8Wuhd7hFYl-2rRtRVQ7Jwdnh34LDrI_waOzWlJNcXP_6D_XbIVnvWxJBSBHtfFYLlbkzvvi93YyrnMc1urx5W9N7pbi7pb11H4xc</recordid><startdate>20220201</startdate><enddate>20220201</enddate><creator>Zimin, Aleksey V</creator><creator>Salzberg, Steven L</creator><general>Public Library of Science</general><general>Public Library of Science (PLoS)</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>ISN</scope><scope>ISR</scope><scope>3V.</scope><scope>7QO</scope><scope>7QP</scope><scope>7TK</scope><scope>7TM</scope><scope>7X7</scope><scope>7XB</scope><scope>88E</scope><scope>8AL</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FR3</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>K9.</scope><scope>LK8</scope><scope>M0N</scope><scope>M0S</scope><scope>M1P</scope><scope>M7P</scope><scope>P5Z</scope><scope>P62</scope><scope>P64</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>Q9U</scope><scope>RC3</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-8859-7432</orcidid><orcidid>https://orcid.org/0000-0001-5091-3092</orcidid></search><sort><creationdate>20220201</creationdate><title>The SAMBA tool uses long reads to improve the contiguity of genome assemblies</title><author>Zimin, Aleksey V ; Salzberg, Steven L</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c633t-948bb8e87b7939a7fea9f80e94efb22397ad6570fa377c0aa102dfb977cbf18a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Analysis</topic><topic>Assemblies</topic><topic>Biology and Life Sciences</topic><topic>Computer and Information Sciences</topic><topic>Engineering and Technology</topic><topic>Gene sequencing</topic><topic>Genomes</topic><topic>Genomics</topic><topic>High-Throughput Nucleotide Sequencing - methods</topic><topic>Open source software</topic><topic>Public software</topic><topic>Research and Analysis Methods</topic><topic>Scaffolding</topic><topic>Scaffolds</topic><topic>Science Policy</topic><topic>Software</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zimin, Aleksey V</creatorcontrib><creatorcontrib>Salzberg, Steven L</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Canada</collection><collection>Gale In Context: Science</collection><collection>ProQuest Central (Corporate)</collection><collection>Biotechnology Research Abstracts</collection><collection>Calcium & Calcified Tissue Abstracts</collection><collection>Neurosciences Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>Health & Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Medical Database (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Engineering Research Database</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>ProQuest Biological Science Collection</collection><collection>Computing Database</collection><collection>Health & Medical Collection (Alumni Edition)</collection><collection>Medical Database</collection><collection>Biological Science Database</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>ProQuest Central Basic</collection><collection>Genetics Abstracts</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>PLoS computational biology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zimin, Aleksey V</au><au>Salzberg, Steven L</au><au>Shao, Mingfu</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>The SAMBA tool uses long reads to improve the contiguity of genome assemblies</atitle><jtitle>PLoS computational biology</jtitle><addtitle>PLoS Comput Biol</addtitle><date>2022-02-01</date><risdate>2022</risdate><volume>18</volume><issue>2</issue><spage>e1009860</spage><epage>e1009860</epage><pages>e1009860-e1009860</pages><issn>1553-7358</issn><issn>1553-734X</issn><eissn>1553-7358</eissn><abstract>Third-generation sequencing technologies can generate very long reads with relatively high error rates. The lengths of the reads, which sometimes exceed one million bases, make them invaluable for resolving complex repeats that cannot be assembled using shorter reads. Many high-quality genome assemblies have already been produced, curated, and annotated using the previous generation of sequencing data, and full re-assembly of these genomes with long reads is not always practical or cost-effective. One strategy to upgrade existing assemblies is to generate additional coverage using long-read data, and add that to the previously assembled contigs. SAMBA is a tool that is designed to scaffold and gap-fill existing genome assemblies with additional long-read data, resulting in substantially greater contiguity. SAMBA is the only tool of its kind that also computes and fills in the sequence for all spanned gaps in the scaffolds, yielding much longer contigs. Here we compare SAMBA to several similar tools capable of re-scaffolding assemblies using long-read data, and we show that SAMBA yields better contiguity and introduces fewer errors than competing methods. SAMBA is open-source software that is distributed at https://github.com/alekseyzimin/masurca.</abstract><cop>United States</cop><pub>Public Library of Science</pub><pmid>35120119</pmid><doi>10.1371/journal.pcbi.1009860</doi><orcidid>https://orcid.org/0000-0002-8859-7432</orcidid><orcidid>https://orcid.org/0000-0001-5091-3092</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1553-7358 |
ispartof | PLoS computational biology, 2022-02, Vol.18 (2), p.e1009860-e1009860 |
issn | 1553-7358 1553-734X 1553-7358 |
language | eng |
recordid | cdi_plos_journals_2640120387 |
source | Public Library of Science (PLoS) Journals Open Access; MEDLINE; DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; PubMed Central |
subjects | Analysis Assemblies Biology and Life Sciences Computer and Information Sciences Engineering and Technology Gene sequencing Genomes Genomics High-Throughput Nucleotide Sequencing - methods Open source software Public software Research and Analysis Methods Scaffolding Scaffolds Science Policy Software |
title | The SAMBA tool uses long reads to improve the contiguity of genome assemblies |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-01T10%3A02%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_plos_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=The%20SAMBA%20tool%20uses%20long%20reads%20to%20improve%20the%20contiguity%20of%20genome%20assemblies&rft.jtitle=PLoS%20computational%20biology&rft.au=Zimin,%20Aleksey%20V&rft.date=2022-02-01&rft.volume=18&rft.issue=2&rft.spage=e1009860&rft.epage=e1009860&rft.pages=e1009860-e1009860&rft.issn=1553-7358&rft.eissn=1553-7358&rft_id=info:doi/10.1371/journal.pcbi.1009860&rft_dat=%3Cgale_plos_%3EA695461094%3C/gale_plos_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2640120387&rft_id=info:pmid/35120119&rft_galeid=A695461094&rft_doaj_id=oai_doaj_org_article_e59d3dc3f38340868488f1b80981ac2d&rfr_iscdi=true |