Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT

The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k-mers have shown the advantage of this approach in identification and...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	PLoS computational biology 2021-11, Vol.17 (11), p.e1009449-e1009449
Hauptverfasser:	Sarmashghi, Shahab, Balaban, Metin, Rachtman, Eleonora, Touri, Behrouz, Mirarab, Siavash, Bafna, Vineet
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Analysis Animals Biodiversity Biology and Life Sciences Computational Biology Computer Simulation Constraint modelling Contaminants Databases, Genetic - statistics & numerical data DNA sequencing Ecological effects Empirical analysis Engineering and Technology Errors Estimates Estimation Gene sequencing Genome Genomes Genomics Genomics - statistics & numerical data Humans Identification and classification Invertebrates Invertebrates - classification Invertebrates - genetics Least-Squares Analysis Linear Models Linear programming Mammals - classification Mammals - genetics Methods Models, Genetic Nucleotide sequencing Optimization Parameters Phylogenetics Phylogeny Physical Sciences Plants - classification Plants - genetics Repetitive Sequences, Nucleic Acid Research and Analysis Methods Software Spectra Taxonomy Vertebrates - classification Vertebrates - genetics
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	e1009449
container_issue	11
container_start_page	e1009449
container_title	PLoS computational biology
container_volume	17
creator	Sarmashghi, Shahab Balaban, Metin Rachtman, Eleonora Touri, Behrouz Mirarab, Siavash Bafna, Vineet
description	The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k-mers have shown the advantage of this approach in identification and phylogenetic placement of eukaryotic species. Here, we revisit the basic question of estimating genomic parameters such as genome length, coverage, and repeat structure, focusing specifically on estimating the k-mer repeat spectrum. We show using a mix of theoretical and empirical analysis that there are fundamental limitations to estimating the k-mer spectra due to ill-conditioned systems, and that has implications for other genomic parameters. We get around this problem using a novel constrained optimization approach (Spline Linear Programming), where the constraints are learned empirically. On reads simulated at 1X coverage from 66 genomes, our method, REPeat SPECTra Estimation (RESPECT), had 2.2% error in length estimation compared to 27% error previously achieved. In shotgun sequenced read samples with contaminants, RESPECT length estimates had median error 4%, in contrast to other methods that had median error 80%. Together, the results suggest that low-pass genomic sequencing can yield reliable estimates of the length and repeat content of the genome. The RESPECT software will be publicly available at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shahab-2Dsarmashghi_RESPECT.git&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=717o8hLR1JmHFpRPSWG6xdUQTikyUjicjkipjFsKG4w&e=.
doi_str_mv	10.1371/journal.pcbi.1009449
format	Article
fullrecord	<record><control><sourceid>gale_plos_</sourceid><recordid>TN_cdi_plos_journals_2610945737</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A684564392</galeid><doaj_id>oai_doaj_org_article_1b2807756750402fbe091bd2503aa6e9</doaj_id><sourcerecordid>A684564392</sourcerecordid><originalsourceid>FETCH-LOGICAL-c633t-245f8fe82bbf950710300d1568ff999fba948d96634c84e08ff747492bd188253</originalsourceid><addsrcrecordid>eNqVkk1vEzEQhlcIREvhHyBYiQscEuz19wWpilKIVAFqizhaXu9467C7Dvampf8ehyRVg7ggHzyaeebr1RTFS4ymmAj8fhnWcTDddGVrP8UIKUrVo-IYM0YmgjD5-IF9VDxLaYlQNhV_WhwRKiSiXB4X3-dp9L0Z_dCWEVZgxjKtwI7RlGZoyhaG0EPZwdCO16WLoS-7cDux4QaiaWEfTz98n8pbn5mL-eXX-ezqefHEmS7Bi91_Unw7m1_NPk3Ov3xczE7PJ5YTMk4qypx0IKu6doohgRFBqMGMS-eUUq42ispGcU6olRRQdgsqqKrqBktZMXJSvN7WXXUh6Z0mSVccZ0GYICITiy3RBLPUq5i3jXc6GK__OEJstYmjtx1oXFcSCcG4YIiiytWAFK6biiFiDAeVa33YdVvXPTQWhixUd1D0MDL4a92GGy15pYjaDPN2VyCGn2tIo-59stB1ZoCwznMzlSdgUm56vfkL_fd20y3VmryAH1zIfW1-DfTehgGcz_5TLinjlKgqJ7w7SMjMCL_G1qxT0ovLi_9gPx-ydMvaGFKK4O5VwUhvDnY_vt4crN4dbE579VDR-6T9hZLfkWrloQ</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2610945737</pqid></control><display><type>article</type><title>Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT</title><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>PubMed Central</source><source>Public Library of Science (PLoS)</source><creator>Sarmashghi, Shahab ; Balaban, Metin ; Rachtman, Eleonora ; Touri, Behrouz ; Mirarab, Siavash ; Bafna, Vineet</creator><creatorcontrib>Sarmashghi, Shahab ; Balaban, Metin ; Rachtman, Eleonora ; Touri, Behrouz ; Mirarab, Siavash ; Bafna, Vineet</creatorcontrib><description><![CDATA[The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k-mers have shown the advantage of this approach in identification and phylogenetic placement of eukaryotic species. Here, we revisit the basic question of estimating genomic parameters such as genome length, coverage, and repeat structure, focusing specifically on estimating the k-mer repeat spectrum. We show using a mix of theoretical and empirical analysis that there are fundamental limitations to estimating the k-mer spectra due to ill-conditioned systems, and that has implications for other genomic parameters. We get around this problem using a novel constrained optimization approach (Spline Linear Programming), where the constraints are learned empirically. On reads simulated at 1X coverage from 66 genomes, our method, REPeat SPECTra Estimation (RESPECT), had 2.2% error in length estimation compared to 27% error previously achieved. In shotgun sequenced read samples with contaminants, RESPECT length estimates had median error 4%, in contrast to other methods that had median error 80%. Together, the results suggest that low-pass genomic sequencing can yield reliable estimates of the length and repeat content of the genome. The RESPECT software will be publicly available at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shahab-2Dsarmashghi_RESPECT.git&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=717o8hLR1JmHFpRPSWG6xdUQTikyUjicjkipjFsKG4w&e=.]]></description><identifier>ISSN: 1553-7358</identifier><identifier>ISSN: 1553-734X</identifier><identifier>EISSN: 1553-7358</identifier><identifier>DOI: 10.1371/journal.pcbi.1009449</identifier><identifier>PMID: 34780468</identifier><language>eng</language><publisher>United States: Public Library of Science</publisher><subject>Algorithms ; Analysis ; Animals ; Biodiversity ; Biology and Life Sciences ; Computational Biology ; Computer Simulation ; Constraint modelling ; Contaminants ; Databases, Genetic - statistics & numerical data ; DNA sequencing ; Ecological effects ; Empirical analysis ; Engineering and Technology ; Errors ; Estimates ; Estimation ; Gene sequencing ; Genome ; Genomes ; Genomics ; Genomics - statistics & numerical data ; Humans ; Identification and classification ; Invertebrates ; Invertebrates - classification ; Invertebrates - genetics ; Least-Squares Analysis ; Linear Models ; Linear programming ; Mammals - classification ; Mammals - genetics ; Methods ; Models, Genetic ; Nucleotide sequencing ; Optimization ; Parameters ; Phylogenetics ; Phylogeny ; Physical Sciences ; Plants - classification ; Plants - genetics ; Repetitive Sequences, Nucleic Acid ; Research and Analysis Methods ; Software ; Spectra ; Taxonomy ; Vertebrates - classification ; Vertebrates - genetics</subject><ispartof>PLoS computational biology, 2021-11, Vol.17 (11), p.e1009449-e1009449</ispartof><rights>COPYRIGHT 2021 Public Library of Science</rights><rights>2021 Sarmashghi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>2021 Sarmashghi et al 2021 Sarmashghi et al</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c633t-245f8fe82bbf950710300d1568ff999fba948d96634c84e08ff747492bd188253</citedby><cites>FETCH-LOGICAL-c633t-245f8fe82bbf950710300d1568ff999fba948d96634c84e08ff747492bd188253</cites><orcidid>0000-0003-0564-1643 ; 0000-0002-6947-5915 ; 0000-0003-4724-7329 ; 0000-0002-6104-5750 ; 0000-0001-5410-1518 ; 0000-0002-5810-6241</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC8629397/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC8629397/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,860,881,2096,2915,23845,27901,27902,53766,53768,79343,79344</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/34780468$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Sarmashghi, Shahab</creatorcontrib><creatorcontrib>Balaban, Metin</creatorcontrib><creatorcontrib>Rachtman, Eleonora</creatorcontrib><creatorcontrib>Touri, Behrouz</creatorcontrib><creatorcontrib>Mirarab, Siavash</creatorcontrib><creatorcontrib>Bafna, Vineet</creatorcontrib><title>Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT</title><title>PLoS computational biology</title><addtitle>PLoS Comput Biol</addtitle><description><![CDATA[The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k-mers have shown the advantage of this approach in identification and phylogenetic placement of eukaryotic species. Here, we revisit the basic question of estimating genomic parameters such as genome length, coverage, and repeat structure, focusing specifically on estimating the k-mer repeat spectrum. We show using a mix of theoretical and empirical analysis that there are fundamental limitations to estimating the k-mer spectra due to ill-conditioned systems, and that has implications for other genomic parameters. We get around this problem using a novel constrained optimization approach (Spline Linear Programming), where the constraints are learned empirically. On reads simulated at 1X coverage from 66 genomes, our method, REPeat SPECTra Estimation (RESPECT), had 2.2% error in length estimation compared to 27% error previously achieved. In shotgun sequenced read samples with contaminants, RESPECT length estimates had median error 4%, in contrast to other methods that had median error 80%. Together, the results suggest that low-pass genomic sequencing can yield reliable estimates of the length and repeat content of the genome. The RESPECT software will be publicly available at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shahab-2Dsarmashghi_RESPECT.git&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=717o8hLR1JmHFpRPSWG6xdUQTikyUjicjkipjFsKG4w&e=.]]></description><subject>Algorithms</subject><subject>Analysis</subject><subject>Animals</subject><subject>Biodiversity</subject><subject>Biology and Life Sciences</subject><subject>Computational Biology</subject><subject>Computer Simulation</subject><subject>Constraint modelling</subject><subject>Contaminants</subject><subject>Databases, Genetic - statistics & numerical data</subject><subject>DNA sequencing</subject><subject>Ecological effects</subject><subject>Empirical analysis</subject><subject>Engineering and Technology</subject><subject>Errors</subject><subject>Estimates</subject><subject>Estimation</subject><subject>Gene sequencing</subject><subject>Genome</subject><subject>Genomes</subject><subject>Genomics</subject><subject>Genomics - statistics & numerical data</subject><subject>Humans</subject><subject>Identification and classification</subject><subject>Invertebrates</subject><subject>Invertebrates - classification</subject><subject>Invertebrates - genetics</subject><subject>Least-Squares Analysis</subject><subject>Linear Models</subject><subject>Linear programming</subject><subject>Mammals - classification</subject><subject>Mammals - genetics</subject><subject>Methods</subject><subject>Models, Genetic</subject><subject>Nucleotide sequencing</subject><subject>Optimization</subject><subject>Parameters</subject><subject>Phylogenetics</subject><subject>Phylogeny</subject><subject>Physical Sciences</subject><subject>Plants - classification</subject><subject>Plants - genetics</subject><subject>Repetitive Sequences, Nucleic Acid</subject><subject>Research and Analysis Methods</subject><subject>Software</subject><subject>Spectra</subject><subject>Taxonomy</subject><subject>Vertebrates - classification</subject><subject>Vertebrates - genetics</subject><issn>1553-7358</issn><issn>1553-734X</issn><issn>1553-7358</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><sourceid>BENPR</sourceid><sourceid>DOA</sourceid><recordid>eNqVkk1vEzEQhlcIREvhHyBYiQscEuz19wWpilKIVAFqizhaXu9467C7Dvampf8ehyRVg7ggHzyaeebr1RTFS4ymmAj8fhnWcTDddGVrP8UIKUrVo-IYM0YmgjD5-IF9VDxLaYlQNhV_WhwRKiSiXB4X3-dp9L0Z_dCWEVZgxjKtwI7RlGZoyhaG0EPZwdCO16WLoS-7cDux4QaiaWEfTz98n8pbn5mL-eXX-ezqefHEmS7Bi91_Unw7m1_NPk3Ov3xczE7PJ5YTMk4qypx0IKu6doohgRFBqMGMS-eUUq42ispGcU6olRRQdgsqqKrqBktZMXJSvN7WXXUh6Z0mSVccZ0GYICITiy3RBLPUq5i3jXc6GK__OEJstYmjtx1oXFcSCcG4YIiiytWAFK6biiFiDAeVa33YdVvXPTQWhixUd1D0MDL4a92GGy15pYjaDPN2VyCGn2tIo-59stB1ZoCwznMzlSdgUm56vfkL_fd20y3VmryAH1zIfW1-DfTehgGcz_5TLinjlKgqJ7w7SMjMCL_G1qxT0ovLi_9gPx-ydMvaGFKK4O5VwUhvDnY_vt4crN4dbE579VDR-6T9hZLfkWrloQ</recordid><startdate>20211101</startdate><enddate>20211101</enddate><creator>Sarmashghi, Shahab</creator><creator>Balaban, Metin</creator><creator>Rachtman, Eleonora</creator><creator>Touri, Behrouz</creator><creator>Mirarab, Siavash</creator><creator>Bafna, Vineet</creator><general>Public Library of Science</general><general>Public Library of Science (PLoS)</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>ISN</scope><scope>ISR</scope><scope>3V.</scope><scope>7QO</scope><scope>7QP</scope><scope>7TK</scope><scope>7TM</scope><scope>7X7</scope><scope>7XB</scope><scope>88E</scope><scope>8AL</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AEUYN</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FR3</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>K9.</scope><scope>LK8</scope><scope>M0N</scope><scope>M0S</scope><scope>M1P</scope><scope>M7P</scope><scope>P5Z</scope><scope>P62</scope><scope>P64</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>Q9U</scope><scope>RC3</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0003-0564-1643</orcidid><orcidid>https://orcid.org/0000-0002-6947-5915</orcidid><orcidid>https://orcid.org/0000-0003-4724-7329</orcidid><orcidid>https://orcid.org/0000-0002-6104-5750</orcidid><orcidid>https://orcid.org/0000-0001-5410-1518</orcidid><orcidid>https://orcid.org/0000-0002-5810-6241</orcidid></search><sort><creationdate>20211101</creationdate><title>Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT</title><author>Sarmashghi, Shahab ; Balaban, Metin ; Rachtman, Eleonora ; Touri, Behrouz ; Mirarab, Siavash ; Bafna, Vineet</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c633t-245f8fe82bbf950710300d1568ff999fba948d96634c84e08ff747492bd188253</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Algorithms</topic><topic>Analysis</topic><topic>Animals</topic><topic>Biodiversity</topic><topic>Biology and Life Sciences</topic><topic>Computational Biology</topic><topic>Computer Simulation</topic><topic>Constraint modelling</topic><topic>Contaminants</topic><topic>Databases, Genetic - statistics & numerical data</topic><topic>DNA sequencing</topic><topic>Ecological effects</topic><topic>Empirical analysis</topic><topic>Engineering and Technology</topic><topic>Errors</topic><topic>Estimates</topic><topic>Estimation</topic><topic>Gene sequencing</topic><topic>Genome</topic><topic>Genomes</topic><topic>Genomics</topic><topic>Genomics - statistics & numerical data</topic><topic>Humans</topic><topic>Identification and classification</topic><topic>Invertebrates</topic><topic>Invertebrates - classification</topic><topic>Invertebrates - genetics</topic><topic>Least-Squares Analysis</topic><topic>Linear Models</topic><topic>Linear programming</topic><topic>Mammals - classification</topic><topic>Mammals - genetics</topic><topic>Methods</topic><topic>Models, Genetic</topic><topic>Nucleotide sequencing</topic><topic>Optimization</topic><topic>Parameters</topic><topic>Phylogenetics</topic><topic>Phylogeny</topic><topic>Physical Sciences</topic><topic>Plants - classification</topic><topic>Plants - genetics</topic><topic>Repetitive Sequences, Nucleic Acid</topic><topic>Research and Analysis Methods</topic><topic>Software</topic><topic>Spectra</topic><topic>Taxonomy</topic><topic>Vertebrates - classification</topic><topic>Vertebrates - genetics</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Sarmashghi, Shahab</creatorcontrib><creatorcontrib>Balaban, Metin</creatorcontrib><creatorcontrib>Rachtman, Eleonora</creatorcontrib><creatorcontrib>Touri, Behrouz</creatorcontrib><creatorcontrib>Mirarab, Siavash</creatorcontrib><creatorcontrib>Bafna, Vineet</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Canada</collection><collection>Gale In Context: Science</collection><collection>ProQuest Central (Corporate)</collection><collection>Biotechnology Research Abstracts</collection><collection>Calcium & Calcified Tissue Abstracts</collection><collection>Neurosciences Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>Health & Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Medical Database (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest One Sustainability</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Engineering Research Database</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>ProQuest Biological Science Collection</collection><collection>Computing Database</collection><collection>Health & Medical Collection (Alumni Edition)</collection><collection>Medical Database</collection><collection>Biological Science Database</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>ProQuest Central Basic</collection><collection>Genetics Abstracts</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>PLoS computational biology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Sarmashghi, Shahab</au><au>Balaban, Metin</au><au>Rachtman, Eleonora</au><au>Touri, Behrouz</au><au>Mirarab, Siavash</au><au>Bafna, Vineet</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT</atitle><jtitle>PLoS computational biology</jtitle><addtitle>PLoS Comput Biol</addtitle><date>2021-11-01</date><risdate>2021</risdate><volume>17</volume><issue>11</issue><spage>e1009449</spage><epage>e1009449</epage><pages>e1009449-e1009449</pages><issn>1553-7358</issn><issn>1553-734X</issn><eissn>1553-7358</eissn><abstract><![CDATA[The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k-mers have shown the advantage of this approach in identification and phylogenetic placement of eukaryotic species. Here, we revisit the basic question of estimating genomic parameters such as genome length, coverage, and repeat structure, focusing specifically on estimating the k-mer repeat spectrum. We show using a mix of theoretical and empirical analysis that there are fundamental limitations to estimating the k-mer spectra due to ill-conditioned systems, and that has implications for other genomic parameters. We get around this problem using a novel constrained optimization approach (Spline Linear Programming), where the constraints are learned empirically. On reads simulated at 1X coverage from 66 genomes, our method, REPeat SPECTra Estimation (RESPECT), had 2.2% error in length estimation compared to 27% error previously achieved. In shotgun sequenced read samples with contaminants, RESPECT length estimates had median error 4%, in contrast to other methods that had median error 80%. Together, the results suggest that low-pass genomic sequencing can yield reliable estimates of the length and repeat content of the genome. The RESPECT software will be publicly available at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shahab-2Dsarmashghi_RESPECT.git&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=717o8hLR1JmHFpRPSWG6xdUQTikyUjicjkipjFsKG4w&e=.]]></abstract><cop>United States</cop><pub>Public Library of Science</pub><pmid>34780468</pmid><doi>10.1371/journal.pcbi.1009449</doi><orcidid>https://orcid.org/0000-0003-0564-1643</orcidid><orcidid>https://orcid.org/0000-0002-6947-5915</orcidid><orcidid>https://orcid.org/0000-0003-4724-7329</orcidid><orcidid>https://orcid.org/0000-0002-6104-5750</orcidid><orcidid>https://orcid.org/0000-0001-5410-1518</orcidid><orcidid>https://orcid.org/0000-0002-5810-6241</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1553-7358
ispartof	PLoS computational biology, 2021-11, Vol.17 (11), p.e1009449-e1009449
issn	1553-7358 1553-734X 1553-7358
language	eng
recordid	cdi_plos_journals_2610945737
source	MEDLINE; DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; PubMed Central; Public Library of Science (PLoS)
subjects	Algorithms Analysis Animals Biodiversity Biology and Life Sciences Computational Biology Computer Simulation Constraint modelling Contaminants Databases, Genetic - statistics & numerical data DNA sequencing Ecological effects Empirical analysis Engineering and Technology Errors Estimates Estimation Gene sequencing Genome Genomes Genomics Genomics - statistics & numerical data Humans Identification and classification Invertebrates Invertebrates - classification Invertebrates - genetics Least-Squares Analysis Linear Models Linear programming Mammals - classification Mammals - genetics Methods Models, Genetic Nucleotide sequencing Optimization Parameters Phylogenetics Phylogeny Physical Sciences Plants - classification Plants - genetics Repetitive Sequences, Nucleic Acid Research and Analysis Methods Software Spectra Taxonomy Vertebrates - classification Vertebrates - genetics
title	Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-02T06%3A09%3A51IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_plos_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Estimating%20repeat%20spectra%20and%20genome%20length%20from%20low-coverage%20genome%20skims%20with%20RESPECT&rft.jtitle=PLoS%20computational%20biology&rft.au=Sarmashghi,%20Shahab&rft.date=2021-11-01&rft.volume=17&rft.issue=11&rft.spage=e1009449&rft.epage=e1009449&rft.pages=e1009449-e1009449&rft.issn=1553-7358&rft.eissn=1553-7358&rft_id=info:doi/10.1371/journal.pcbi.1009449&rft_dat=%3Cgale_plos_%3EA684564392%3C/gale_plos_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2610945737&rft_id=info:pmid/34780468&rft_galeid=A684564392&rft_doaj_id=oai_doaj_org_article_1b2807756750402fbe091bd2503aa6e9&rfr_iscdi=true