Characterizing and measuring bias in sequence data
BACKGROUND: DNA sequencing technologies deviate from the ideal uniform distribution of reads. These biases impair scientific and medical applications. Accordingly, we have developed computational methods for discovering, describing and measuring bias. RESULTS: We applied these methods to the Illumin...
Gespeichert in:
Veröffentlicht in: | Genome Biology (Online Edition) 2013-05, Vol.14 (5), p.R51-R51, Article R51 |
---|---|
Hauptverfasser: | , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | R51 |
---|---|
container_issue | 5 |
container_start_page | R51 |
container_title | Genome Biology (Online Edition) |
container_volume | 14 |
creator | Ross, Michael G Russ, Carsten Costello, Maura Hollinger, Andrew Lennon, Niall J Hegarty, Ryan Nusbaum, Chad Jaffe, David B |
description | BACKGROUND: DNA sequencing technologies deviate from the ideal uniform distribution of reads. These biases impair scientific and medical applications. Accordingly, we have developed computational methods for discovering, describing and measuring bias. RESULTS: We applied these methods to the Illumina, Ion Torrent, Pacific Biosciences and Complete Genomics sequencing platforms, using data from human and from a set of microbes with diverse base compositions. As in previous work, library construction conditions significantly influence sequencing bias. Pacific Biosciences coverage levels are the least biased, followed by Illumina, although all technologies exhibit error-rate biases in high- and low-GC regions and at long homopolymer runs. The GC-rich regions prone to low coverage include a number of human promoters, so we therefore catalog 1,000 that were exceptionally resistant to sequencing. Our results indicate that combining data from two technologies can reduce coverage bias if the biases in the component technologies are complementary and of similar magnitude. Analysis of Illumina data representing 120-fold coverage of a well-studied human sample reveals that 0.20% of the autosomal genome was covered at less than 10% of the genome-wide average. Excluding locations that were similar to known bias motifs or likely due to sample-reference variations left only 0.045% of the autosomal genome with unexplained poor coverage. CONCLUSIONS: The assays presented in this paper provide a comprehensive view of sequencing bias, which can be used to drive laboratory improvements and to monitor production processes. Development guided by these assays should result in improved genome assemblies and better coverage of biologically important loci. |
doi_str_mv | 10.1186/gb-2013-14-5-r51 |
format | Article |
fullrecord | <record><control><sourceid>gale_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_4053816</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A534778677</galeid><sourcerecordid>A534778677</sourcerecordid><originalsourceid>FETCH-LOGICAL-b747t-45a4dc4f327c9ee13a89d8398b49f664ba236caca6d55da38cacdadb3043ab73</originalsourceid><addsrcrecordid>eNqNks1rFjEQxhdRbK3ePeke9bA12XxfhPJStVhQsJ7D5GO3kd1NTXZF-9ebZeuLLyg0l0wyv3kYnpmqeo7RKcaSv-lN0yJMGkwb1iSGH1THmHLWcIXpw32M-FH1JOdvCGEhOX9cHbVEYCkEOa7a3TUksLNP4TZMfQ2Tq0cPeUnrywTIdZjq7L8vfrK-djDD0-pRB0P2z-7uk-rq3fnV7kNz-en9xe7ssjGCirmhDKiztCOtsMp7TEAqJ4mShqqOc2qgJdyCBe4Yc0BkiR04QxAlYAQ5qd5usjeLGb2zfpoTDPomhRHSLx0h6MPMFK51H39oihiRmBeB3SZgQvyPwGHGxlH3Rq-Oakw108XRovLqro0Uiwl51mPI1g8DTD4uWWNVDlVM3gNlVHJCWnUPlGJFuGiRLOjphvYweB2mLpZmV9-cH4ONk-9C-T9jhIoyXbEa9_qgoDCz_zn3sOSsP36-OGTRxtoUc06-29uDkV5X7F-GvPh7LvuCPztVgJcb0EHU0KeQ9dcvRYMihBRiVJHfSPzWYA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1419367208</pqid></control><display><type>article</type><title>Characterizing and measuring bias in sequence data</title><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>Springer Nature OA Free Journals</source><source>Springer Nature - Complete Springer Journals</source><source>PubMed Central</source><creator>Ross, Michael G ; Russ, Carsten ; Costello, Maura ; Hollinger, Andrew ; Lennon, Niall J ; Hegarty, Ryan ; Nusbaum, Chad ; Jaffe, David B</creator><creatorcontrib>Ross, Michael G ; Russ, Carsten ; Costello, Maura ; Hollinger, Andrew ; Lennon, Niall J ; Hegarty, Ryan ; Nusbaum, Chad ; Jaffe, David B</creatorcontrib><description>BACKGROUND: DNA sequencing technologies deviate from the ideal uniform distribution of reads. These biases impair scientific and medical applications. Accordingly, we have developed computational methods for discovering, describing and measuring bias. RESULTS: We applied these methods to the Illumina, Ion Torrent, Pacific Biosciences and Complete Genomics sequencing platforms, using data from human and from a set of microbes with diverse base compositions. As in previous work, library construction conditions significantly influence sequencing bias. Pacific Biosciences coverage levels are the least biased, followed by Illumina, although all technologies exhibit error-rate biases in high- and low-GC regions and at long homopolymer runs. The GC-rich regions prone to low coverage include a number of human promoters, so we therefore catalog 1,000 that were exceptionally resistant to sequencing. Our results indicate that combining data from two technologies can reduce coverage bias if the biases in the component technologies are complementary and of similar magnitude. Analysis of Illumina data representing 120-fold coverage of a well-studied human sample reveals that 0.20% of the autosomal genome was covered at less than 10% of the genome-wide average. Excluding locations that were similar to known bias motifs or likely due to sample-reference variations left only 0.045% of the autosomal genome with unexplained poor coverage. CONCLUSIONS: The assays presented in this paper provide a comprehensive view of sequencing bias, which can be used to drive laboratory improvements and to monitor production processes. Development guided by these assays should result in improved genome assemblies and better coverage of biologically important loci.</description><identifier>ISSN: 1465-6906</identifier><identifier>ISSN: 1474-760X</identifier><identifier>EISSN: 1465-6914</identifier><identifier>EISSN: 1474-760X</identifier><identifier>DOI: 10.1186/gb-2013-14-5-r51</identifier><identifier>PMID: 23718773</identifier><language>eng</language><publisher>England: Springer-Verlag</publisher><subject>Algorithms ; Base Composition ; DNA sequencing ; genome ; Genome, Bacterial ; Genome, Human ; Genome, Protozoan ; Genomes ; Genomics ; Genomics - methods ; Humans ; instrumentation ; loci ; Methods ; microorganisms ; Nucleotide sequencing ; Promoter Regions, Genetic ; sequence analysis ; Sequence Analysis, DNA ; Sequence Analysis, DNA - instrumentation ; Sequence Analysis, DNA - methods</subject><ispartof>Genome Biology (Online Edition), 2013-05, Vol.14 (5), p.R51-R51, Article R51</ispartof><rights>COPYRIGHT 2013 BioMed Central Ltd.</rights><rights>Copyright © 2013 Ross et al.; licensee BioMed Central Ltd. 2013 Ross et al.; licensee BioMed Central Ltd.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-b747t-45a4dc4f327c9ee13a89d8398b49f664ba236caca6d55da38cacdadb3043ab73</citedby><cites>FETCH-LOGICAL-b747t-45a4dc4f327c9ee13a89d8398b49f664ba236caca6d55da38cacdadb3043ab73</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053816/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053816/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,860,881,27903,27904,53770,53772</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/23718773$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Ross, Michael G</creatorcontrib><creatorcontrib>Russ, Carsten</creatorcontrib><creatorcontrib>Costello, Maura</creatorcontrib><creatorcontrib>Hollinger, Andrew</creatorcontrib><creatorcontrib>Lennon, Niall J</creatorcontrib><creatorcontrib>Hegarty, Ryan</creatorcontrib><creatorcontrib>Nusbaum, Chad</creatorcontrib><creatorcontrib>Jaffe, David B</creatorcontrib><title>Characterizing and measuring bias in sequence data</title><title>Genome Biology (Online Edition)</title><addtitle>Genome Biol</addtitle><description>BACKGROUND: DNA sequencing technologies deviate from the ideal uniform distribution of reads. These biases impair scientific and medical applications. Accordingly, we have developed computational methods for discovering, describing and measuring bias. RESULTS: We applied these methods to the Illumina, Ion Torrent, Pacific Biosciences and Complete Genomics sequencing platforms, using data from human and from a set of microbes with diverse base compositions. As in previous work, library construction conditions significantly influence sequencing bias. Pacific Biosciences coverage levels are the least biased, followed by Illumina, although all technologies exhibit error-rate biases in high- and low-GC regions and at long homopolymer runs. The GC-rich regions prone to low coverage include a number of human promoters, so we therefore catalog 1,000 that were exceptionally resistant to sequencing. Our results indicate that combining data from two technologies can reduce coverage bias if the biases in the component technologies are complementary and of similar magnitude. Analysis of Illumina data representing 120-fold coverage of a well-studied human sample reveals that 0.20% of the autosomal genome was covered at less than 10% of the genome-wide average. Excluding locations that were similar to known bias motifs or likely due to sample-reference variations left only 0.045% of the autosomal genome with unexplained poor coverage. CONCLUSIONS: The assays presented in this paper provide a comprehensive view of sequencing bias, which can be used to drive laboratory improvements and to monitor production processes. Development guided by these assays should result in improved genome assemblies and better coverage of biologically important loci.</description><subject>Algorithms</subject><subject>Base Composition</subject><subject>DNA sequencing</subject><subject>genome</subject><subject>Genome, Bacterial</subject><subject>Genome, Human</subject><subject>Genome, Protozoan</subject><subject>Genomes</subject><subject>Genomics</subject><subject>Genomics - methods</subject><subject>Humans</subject><subject>instrumentation</subject><subject>loci</subject><subject>Methods</subject><subject>microorganisms</subject><subject>Nucleotide sequencing</subject><subject>Promoter Regions, Genetic</subject><subject>sequence analysis</subject><subject>Sequence Analysis, DNA</subject><subject>Sequence Analysis, DNA - instrumentation</subject><subject>Sequence Analysis, DNA - methods</subject><issn>1465-6906</issn><issn>1474-760X</issn><issn>1465-6914</issn><issn>1474-760X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2013</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><sourceid>KPI</sourceid><recordid>eNqNks1rFjEQxhdRbK3ePeke9bA12XxfhPJStVhQsJ7D5GO3kd1NTXZF-9ebZeuLLyg0l0wyv3kYnpmqeo7RKcaSv-lN0yJMGkwb1iSGH1THmHLWcIXpw32M-FH1JOdvCGEhOX9cHbVEYCkEOa7a3TUksLNP4TZMfQ2Tq0cPeUnrywTIdZjq7L8vfrK-djDD0-pRB0P2z-7uk-rq3fnV7kNz-en9xe7ssjGCirmhDKiztCOtsMp7TEAqJ4mShqqOc2qgJdyCBe4Yc0BkiR04QxAlYAQ5qd5usjeLGb2zfpoTDPomhRHSLx0h6MPMFK51H39oihiRmBeB3SZgQvyPwGHGxlH3Rq-Oakw108XRovLqro0Uiwl51mPI1g8DTD4uWWNVDlVM3gNlVHJCWnUPlGJFuGiRLOjphvYweB2mLpZmV9-cH4ONk-9C-T9jhIoyXbEa9_qgoDCz_zn3sOSsP36-OGTRxtoUc06-29uDkV5X7F-GvPh7LvuCPztVgJcb0EHU0KeQ9dcvRYMihBRiVJHfSPzWYA</recordid><startdate>20130529</startdate><enddate>20130529</enddate><creator>Ross, Michael G</creator><creator>Russ, Carsten</creator><creator>Costello, Maura</creator><creator>Hollinger, Andrew</creator><creator>Lennon, Niall J</creator><creator>Hegarty, Ryan</creator><creator>Nusbaum, Chad</creator><creator>Jaffe, David B</creator><general>Springer-Verlag</general><general>BioMed Central Ltd</general><general>BioMed Central</general><scope>FBQ</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>KPI</scope><scope>IAO</scope><scope>7TM</scope><scope>8FD</scope><scope>FR3</scope><scope>P64</scope><scope>RC3</scope><scope>7X8</scope><scope>7S9</scope><scope>L.6</scope><scope>5PM</scope></search><sort><creationdate>20130529</creationdate><title>Characterizing and measuring bias in sequence data</title><author>Ross, Michael G ; Russ, Carsten ; Costello, Maura ; Hollinger, Andrew ; Lennon, Niall J ; Hegarty, Ryan ; Nusbaum, Chad ; Jaffe, David B</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-b747t-45a4dc4f327c9ee13a89d8398b49f664ba236caca6d55da38cacdadb3043ab73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2013</creationdate><topic>Algorithms</topic><topic>Base Composition</topic><topic>DNA sequencing</topic><topic>genome</topic><topic>Genome, Bacterial</topic><topic>Genome, Human</topic><topic>Genome, Protozoan</topic><topic>Genomes</topic><topic>Genomics</topic><topic>Genomics - methods</topic><topic>Humans</topic><topic>instrumentation</topic><topic>loci</topic><topic>Methods</topic><topic>microorganisms</topic><topic>Nucleotide sequencing</topic><topic>Promoter Regions, Genetic</topic><topic>sequence analysis</topic><topic>Sequence Analysis, DNA</topic><topic>Sequence Analysis, DNA - instrumentation</topic><topic>Sequence Analysis, DNA - methods</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ross, Michael G</creatorcontrib><creatorcontrib>Russ, Carsten</creatorcontrib><creatorcontrib>Costello, Maura</creatorcontrib><creatorcontrib>Hollinger, Andrew</creatorcontrib><creatorcontrib>Lennon, Niall J</creatorcontrib><creatorcontrib>Hegarty, Ryan</creatorcontrib><creatorcontrib>Nusbaum, Chad</creatorcontrib><creatorcontrib>Jaffe, David B</creatorcontrib><collection>AGRIS</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Global Issues</collection><collection>Gale Academic OneFile</collection><collection>Nucleic Acids Abstracts</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Genetics Abstracts</collection><collection>MEDLINE - Academic</collection><collection>AGRICOLA</collection><collection>AGRICOLA - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Genome Biology (Online Edition)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ross, Michael G</au><au>Russ, Carsten</au><au>Costello, Maura</au><au>Hollinger, Andrew</au><au>Lennon, Niall J</au><au>Hegarty, Ryan</au><au>Nusbaum, Chad</au><au>Jaffe, David B</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Characterizing and measuring bias in sequence data</atitle><jtitle>Genome Biology (Online Edition)</jtitle><addtitle>Genome Biol</addtitle><date>2013-05-29</date><risdate>2013</risdate><volume>14</volume><issue>5</issue><spage>R51</spage><epage>R51</epage><pages>R51-R51</pages><artnum>R51</artnum><issn>1465-6906</issn><issn>1474-760X</issn><eissn>1465-6914</eissn><eissn>1474-760X</eissn><abstract>BACKGROUND: DNA sequencing technologies deviate from the ideal uniform distribution of reads. These biases impair scientific and medical applications. Accordingly, we have developed computational methods for discovering, describing and measuring bias. RESULTS: We applied these methods to the Illumina, Ion Torrent, Pacific Biosciences and Complete Genomics sequencing platforms, using data from human and from a set of microbes with diverse base compositions. As in previous work, library construction conditions significantly influence sequencing bias. Pacific Biosciences coverage levels are the least biased, followed by Illumina, although all technologies exhibit error-rate biases in high- and low-GC regions and at long homopolymer runs. The GC-rich regions prone to low coverage include a number of human promoters, so we therefore catalog 1,000 that were exceptionally resistant to sequencing. Our results indicate that combining data from two technologies can reduce coverage bias if the biases in the component technologies are complementary and of similar magnitude. Analysis of Illumina data representing 120-fold coverage of a well-studied human sample reveals that 0.20% of the autosomal genome was covered at less than 10% of the genome-wide average. Excluding locations that were similar to known bias motifs or likely due to sample-reference variations left only 0.045% of the autosomal genome with unexplained poor coverage. CONCLUSIONS: The assays presented in this paper provide a comprehensive view of sequencing bias, which can be used to drive laboratory improvements and to monitor production processes. Development guided by these assays should result in improved genome assemblies and better coverage of biologically important loci.</abstract><cop>England</cop><pub>Springer-Verlag</pub><pmid>23718773</pmid><doi>10.1186/gb-2013-14-5-r51</doi><tpages>1</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1465-6906 |
ispartof | Genome Biology (Online Edition), 2013-05, Vol.14 (5), p.R51-R51, Article R51 |
issn | 1465-6906 1474-760X 1465-6914 1474-760X |
language | eng |
recordid | cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_4053816 |
source | MEDLINE; DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; Springer Nature OA Free Journals; Springer Nature - Complete Springer Journals; PubMed Central |
subjects | Algorithms Base Composition DNA sequencing genome Genome, Bacterial Genome, Human Genome, Protozoan Genomes Genomics Genomics - methods Humans instrumentation loci Methods microorganisms Nucleotide sequencing Promoter Regions, Genetic sequence analysis Sequence Analysis, DNA Sequence Analysis, DNA - instrumentation Sequence Analysis, DNA - methods |
title | Characterizing and measuring bias in sequence data |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-23T10%3A29%3A24IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Characterizing%20and%20measuring%20bias%20in%20sequence%20data&rft.jtitle=Genome%20Biology%20(Online%20Edition)&rft.au=Ross,%20Michael%20G&rft.date=2013-05-29&rft.volume=14&rft.issue=5&rft.spage=R51&rft.epage=R51&rft.pages=R51-R51&rft.artnum=R51&rft.issn=1465-6906&rft.eissn=1465-6914&rft_id=info:doi/10.1186/gb-2013-14-5-r51&rft_dat=%3Cgale_pubme%3EA534778677%3C/gale_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1419367208&rft_id=info:pmid/23718773&rft_galeid=A534778677&rfr_iscdi=true |