Characterizing and measuring bias in sequence data

BACKGROUND: DNA sequencing technologies deviate from the ideal uniform distribution of reads. These biases impair scientific and medical applications. Accordingly, we have developed computational methods for discovering, describing and measuring bias. RESULTS: We applied these methods to the Illumin...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Genome Biology (Online Edition) 2013-05, Vol.14 (5), p.R51-R51, Article R51
Hauptverfasser: Ross, Michael G, Russ, Carsten, Costello, Maura, Hollinger, Andrew, Lennon, Niall J, Hegarty, Ryan, Nusbaum, Chad, Jaffe, David B
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page R51
container_issue 5
container_start_page R51
container_title Genome Biology (Online Edition)
container_volume 14
creator Ross, Michael G
Russ, Carsten
Costello, Maura
Hollinger, Andrew
Lennon, Niall J
Hegarty, Ryan
Nusbaum, Chad
Jaffe, David B
description BACKGROUND: DNA sequencing technologies deviate from the ideal uniform distribution of reads. These biases impair scientific and medical applications. Accordingly, we have developed computational methods for discovering, describing and measuring bias. RESULTS: We applied these methods to the Illumina, Ion Torrent, Pacific Biosciences and Complete Genomics sequencing platforms, using data from human and from a set of microbes with diverse base compositions. As in previous work, library construction conditions significantly influence sequencing bias. Pacific Biosciences coverage levels are the least biased, followed by Illumina, although all technologies exhibit error-rate biases in high- and low-GC regions and at long homopolymer runs. The GC-rich regions prone to low coverage include a number of human promoters, so we therefore catalog 1,000 that were exceptionally resistant to sequencing. Our results indicate that combining data from two technologies can reduce coverage bias if the biases in the component technologies are complementary and of similar magnitude. Analysis of Illumina data representing 120-fold coverage of a well-studied human sample reveals that 0.20% of the autosomal genome was covered at less than 10% of the genome-wide average. Excluding locations that were similar to known bias motifs or likely due to sample-reference variations left only 0.045% of the autosomal genome with unexplained poor coverage. CONCLUSIONS: The assays presented in this paper provide a comprehensive view of sequencing bias, which can be used to drive laboratory improvements and to monitor production processes. Development guided by these assays should result in improved genome assemblies and better coverage of biologically important loci.
doi_str_mv 10.1186/gb-2013-14-5-r51
format Article
fullrecord <record><control><sourceid>gale_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_4053816</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A534778677</galeid><sourcerecordid>A534778677</sourcerecordid><originalsourceid>FETCH-LOGICAL-b747t-45a4dc4f327c9ee13a89d8398b49f664ba236caca6d55da38cacdadb3043ab73</originalsourceid><addsrcrecordid>eNqNks1rFjEQxhdRbK3ePeke9bA12XxfhPJStVhQsJ7D5GO3kd1NTXZF-9ebZeuLLyg0l0wyv3kYnpmqeo7RKcaSv-lN0yJMGkwb1iSGH1THmHLWcIXpw32M-FH1JOdvCGEhOX9cHbVEYCkEOa7a3TUksLNP4TZMfQ2Tq0cPeUnrywTIdZjq7L8vfrK-djDD0-pRB0P2z-7uk-rq3fnV7kNz-en9xe7ssjGCirmhDKiztCOtsMp7TEAqJ4mShqqOc2qgJdyCBe4Yc0BkiR04QxAlYAQ5qd5usjeLGb2zfpoTDPomhRHSLx0h6MPMFK51H39oihiRmBeB3SZgQvyPwGHGxlH3Rq-Oakw108XRovLqro0Uiwl51mPI1g8DTD4uWWNVDlVM3gNlVHJCWnUPlGJFuGiRLOjphvYweB2mLpZmV9-cH4ONk-9C-T9jhIoyXbEa9_qgoDCz_zn3sOSsP36-OGTRxtoUc06-29uDkV5X7F-GvPh7LvuCPztVgJcb0EHU0KeQ9dcvRYMihBRiVJHfSPzWYA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1419367208</pqid></control><display><type>article</type><title>Characterizing and measuring bias in sequence data</title><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>Springer Nature OA Free Journals</source><source>Springer Nature - Complete Springer Journals</source><source>PubMed Central</source><creator>Ross, Michael G ; Russ, Carsten ; Costello, Maura ; Hollinger, Andrew ; Lennon, Niall J ; Hegarty, Ryan ; Nusbaum, Chad ; Jaffe, David B</creator><creatorcontrib>Ross, Michael G ; Russ, Carsten ; Costello, Maura ; Hollinger, Andrew ; Lennon, Niall J ; Hegarty, Ryan ; Nusbaum, Chad ; Jaffe, David B</creatorcontrib><description>BACKGROUND: DNA sequencing technologies deviate from the ideal uniform distribution of reads. These biases impair scientific and medical applications. Accordingly, we have developed computational methods for discovering, describing and measuring bias. RESULTS: We applied these methods to the Illumina, Ion Torrent, Pacific Biosciences and Complete Genomics sequencing platforms, using data from human and from a set of microbes with diverse base compositions. As in previous work, library construction conditions significantly influence sequencing bias. Pacific Biosciences coverage levels are the least biased, followed by Illumina, although all technologies exhibit error-rate biases in high- and low-GC regions and at long homopolymer runs. The GC-rich regions prone to low coverage include a number of human promoters, so we therefore catalog 1,000 that were exceptionally resistant to sequencing. Our results indicate that combining data from two technologies can reduce coverage bias if the biases in the component technologies are complementary and of similar magnitude. Analysis of Illumina data representing 120-fold coverage of a well-studied human sample reveals that 0.20% of the autosomal genome was covered at less than 10% of the genome-wide average. Excluding locations that were similar to known bias motifs or likely due to sample-reference variations left only 0.045% of the autosomal genome with unexplained poor coverage. CONCLUSIONS: The assays presented in this paper provide a comprehensive view of sequencing bias, which can be used to drive laboratory improvements and to monitor production processes. Development guided by these assays should result in improved genome assemblies and better coverage of biologically important loci.</description><identifier>ISSN: 1465-6906</identifier><identifier>ISSN: 1474-760X</identifier><identifier>EISSN: 1465-6914</identifier><identifier>EISSN: 1474-760X</identifier><identifier>DOI: 10.1186/gb-2013-14-5-r51</identifier><identifier>PMID: 23718773</identifier><language>eng</language><publisher>England: Springer-Verlag</publisher><subject>Algorithms ; Base Composition ; DNA sequencing ; genome ; Genome, Bacterial ; Genome, Human ; Genome, Protozoan ; Genomes ; Genomics ; Genomics - methods ; Humans ; instrumentation ; loci ; Methods ; microorganisms ; Nucleotide sequencing ; Promoter Regions, Genetic ; sequence analysis ; Sequence Analysis, DNA ; Sequence Analysis, DNA - instrumentation ; Sequence Analysis, DNA - methods</subject><ispartof>Genome Biology (Online Edition), 2013-05, Vol.14 (5), p.R51-R51, Article R51</ispartof><rights>COPYRIGHT 2013 BioMed Central Ltd.</rights><rights>Copyright © 2013 Ross et al.; licensee BioMed Central Ltd. 2013 Ross et al.; licensee BioMed Central Ltd.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-b747t-45a4dc4f327c9ee13a89d8398b49f664ba236caca6d55da38cacdadb3043ab73</citedby><cites>FETCH-LOGICAL-b747t-45a4dc4f327c9ee13a89d8398b49f664ba236caca6d55da38cacdadb3043ab73</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053816/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053816/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,860,881,27903,27904,53770,53772</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/23718773$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Ross, Michael G</creatorcontrib><creatorcontrib>Russ, Carsten</creatorcontrib><creatorcontrib>Costello, Maura</creatorcontrib><creatorcontrib>Hollinger, Andrew</creatorcontrib><creatorcontrib>Lennon, Niall J</creatorcontrib><creatorcontrib>Hegarty, Ryan</creatorcontrib><creatorcontrib>Nusbaum, Chad</creatorcontrib><creatorcontrib>Jaffe, David B</creatorcontrib><title>Characterizing and measuring bias in sequence data</title><title>Genome Biology (Online Edition)</title><addtitle>Genome Biol</addtitle><description>BACKGROUND: DNA sequencing technologies deviate from the ideal uniform distribution of reads. These biases impair scientific and medical applications. Accordingly, we have developed computational methods for discovering, describing and measuring bias. RESULTS: We applied these methods to the Illumina, Ion Torrent, Pacific Biosciences and Complete Genomics sequencing platforms, using data from human and from a set of microbes with diverse base compositions. As in previous work, library construction conditions significantly influence sequencing bias. Pacific Biosciences coverage levels are the least biased, followed by Illumina, although all technologies exhibit error-rate biases in high- and low-GC regions and at long homopolymer runs. The GC-rich regions prone to low coverage include a number of human promoters, so we therefore catalog 1,000 that were exceptionally resistant to sequencing. Our results indicate that combining data from two technologies can reduce coverage bias if the biases in the component technologies are complementary and of similar magnitude. Analysis of Illumina data representing 120-fold coverage of a well-studied human sample reveals that 0.20% of the autosomal genome was covered at less than 10% of the genome-wide average. Excluding locations that were similar to known bias motifs or likely due to sample-reference variations left only 0.045% of the autosomal genome with unexplained poor coverage. CONCLUSIONS: The assays presented in this paper provide a comprehensive view of sequencing bias, which can be used to drive laboratory improvements and to monitor production processes. Development guided by these assays should result in improved genome assemblies and better coverage of biologically important loci.</description><subject>Algorithms</subject><subject>Base Composition</subject><subject>DNA sequencing</subject><subject>genome</subject><subject>Genome, Bacterial</subject><subject>Genome, Human</subject><subject>Genome, Protozoan</subject><subject>Genomes</subject><subject>Genomics</subject><subject>Genomics - methods</subject><subject>Humans</subject><subject>instrumentation</subject><subject>loci</subject><subject>Methods</subject><subject>microorganisms</subject><subject>Nucleotide sequencing</subject><subject>Promoter Regions, Genetic</subject><subject>sequence analysis</subject><subject>Sequence Analysis, DNA</subject><subject>Sequence Analysis, DNA - instrumentation</subject><subject>Sequence Analysis, DNA - methods</subject><issn>1465-6906</issn><issn>1474-760X</issn><issn>1465-6914</issn><issn>1474-760X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2013</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><sourceid>KPI</sourceid><recordid>eNqNks1rFjEQxhdRbK3ePeke9bA12XxfhPJStVhQsJ7D5GO3kd1NTXZF-9ebZeuLLyg0l0wyv3kYnpmqeo7RKcaSv-lN0yJMGkwb1iSGH1THmHLWcIXpw32M-FH1JOdvCGEhOX9cHbVEYCkEOa7a3TUksLNP4TZMfQ2Tq0cPeUnrywTIdZjq7L8vfrK-djDD0-pRB0P2z-7uk-rq3fnV7kNz-en9xe7ssjGCirmhDKiztCOtsMp7TEAqJ4mShqqOc2qgJdyCBe4Yc0BkiR04QxAlYAQ5qd5usjeLGb2zfpoTDPomhRHSLx0h6MPMFK51H39oihiRmBeB3SZgQvyPwGHGxlH3Rq-Oakw108XRovLqro0Uiwl51mPI1g8DTD4uWWNVDlVM3gNlVHJCWnUPlGJFuGiRLOjphvYweB2mLpZmV9-cH4ONk-9C-T9jhIoyXbEa9_qgoDCz_zn3sOSsP36-OGTRxtoUc06-29uDkV5X7F-GvPh7LvuCPztVgJcb0EHU0KeQ9dcvRYMihBRiVJHfSPzWYA</recordid><startdate>20130529</startdate><enddate>20130529</enddate><creator>Ross, Michael G</creator><creator>Russ, Carsten</creator><creator>Costello, Maura</creator><creator>Hollinger, Andrew</creator><creator>Lennon, Niall J</creator><creator>Hegarty, Ryan</creator><creator>Nusbaum, Chad</creator><creator>Jaffe, David B</creator><general>Springer-Verlag</general><general>BioMed Central Ltd</general><general>BioMed Central</general><scope>FBQ</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>KPI</scope><scope>IAO</scope><scope>7TM</scope><scope>8FD</scope><scope>FR3</scope><scope>P64</scope><scope>RC3</scope><scope>7X8</scope><scope>7S9</scope><scope>L.6</scope><scope>5PM</scope></search><sort><creationdate>20130529</creationdate><title>Characterizing and measuring bias in sequence data</title><author>Ross, Michael G ; Russ, Carsten ; Costello, Maura ; Hollinger, Andrew ; Lennon, Niall J ; Hegarty, Ryan ; Nusbaum, Chad ; Jaffe, David B</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-b747t-45a4dc4f327c9ee13a89d8398b49f664ba236caca6d55da38cacdadb3043ab73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2013</creationdate><topic>Algorithms</topic><topic>Base Composition</topic><topic>DNA sequencing</topic><topic>genome</topic><topic>Genome, Bacterial</topic><topic>Genome, Human</topic><topic>Genome, Protozoan</topic><topic>Genomes</topic><topic>Genomics</topic><topic>Genomics - methods</topic><topic>Humans</topic><topic>instrumentation</topic><topic>loci</topic><topic>Methods</topic><topic>microorganisms</topic><topic>Nucleotide sequencing</topic><topic>Promoter Regions, Genetic</topic><topic>sequence analysis</topic><topic>Sequence Analysis, DNA</topic><topic>Sequence Analysis, DNA - instrumentation</topic><topic>Sequence Analysis, DNA - methods</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ross, Michael G</creatorcontrib><creatorcontrib>Russ, Carsten</creatorcontrib><creatorcontrib>Costello, Maura</creatorcontrib><creatorcontrib>Hollinger, Andrew</creatorcontrib><creatorcontrib>Lennon, Niall J</creatorcontrib><creatorcontrib>Hegarty, Ryan</creatorcontrib><creatorcontrib>Nusbaum, Chad</creatorcontrib><creatorcontrib>Jaffe, David B</creatorcontrib><collection>AGRIS</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Global Issues</collection><collection>Gale Academic OneFile</collection><collection>Nucleic Acids Abstracts</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Genetics Abstracts</collection><collection>MEDLINE - Academic</collection><collection>AGRICOLA</collection><collection>AGRICOLA - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Genome Biology (Online Edition)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ross, Michael G</au><au>Russ, Carsten</au><au>Costello, Maura</au><au>Hollinger, Andrew</au><au>Lennon, Niall J</au><au>Hegarty, Ryan</au><au>Nusbaum, Chad</au><au>Jaffe, David B</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Characterizing and measuring bias in sequence data</atitle><jtitle>Genome Biology (Online Edition)</jtitle><addtitle>Genome Biol</addtitle><date>2013-05-29</date><risdate>2013</risdate><volume>14</volume><issue>5</issue><spage>R51</spage><epage>R51</epage><pages>R51-R51</pages><artnum>R51</artnum><issn>1465-6906</issn><issn>1474-760X</issn><eissn>1465-6914</eissn><eissn>1474-760X</eissn><abstract>BACKGROUND: DNA sequencing technologies deviate from the ideal uniform distribution of reads. These biases impair scientific and medical applications. Accordingly, we have developed computational methods for discovering, describing and measuring bias. RESULTS: We applied these methods to the Illumina, Ion Torrent, Pacific Biosciences and Complete Genomics sequencing platforms, using data from human and from a set of microbes with diverse base compositions. As in previous work, library construction conditions significantly influence sequencing bias. Pacific Biosciences coverage levels are the least biased, followed by Illumina, although all technologies exhibit error-rate biases in high- and low-GC regions and at long homopolymer runs. The GC-rich regions prone to low coverage include a number of human promoters, so we therefore catalog 1,000 that were exceptionally resistant to sequencing. Our results indicate that combining data from two technologies can reduce coverage bias if the biases in the component technologies are complementary and of similar magnitude. Analysis of Illumina data representing 120-fold coverage of a well-studied human sample reveals that 0.20% of the autosomal genome was covered at less than 10% of the genome-wide average. Excluding locations that were similar to known bias motifs or likely due to sample-reference variations left only 0.045% of the autosomal genome with unexplained poor coverage. CONCLUSIONS: The assays presented in this paper provide a comprehensive view of sequencing bias, which can be used to drive laboratory improvements and to monitor production processes. Development guided by these assays should result in improved genome assemblies and better coverage of biologically important loci.</abstract><cop>England</cop><pub>Springer-Verlag</pub><pmid>23718773</pmid><doi>10.1186/gb-2013-14-5-r51</doi><tpages>1</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1465-6906
ispartof Genome Biology (Online Edition), 2013-05, Vol.14 (5), p.R51-R51, Article R51
issn 1465-6906
1474-760X
1465-6914
1474-760X
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_4053816
source MEDLINE; DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; Springer Nature OA Free Journals; Springer Nature - Complete Springer Journals; PubMed Central
subjects Algorithms
Base Composition
DNA sequencing
genome
Genome, Bacterial
Genome, Human
Genome, Protozoan
Genomes
Genomics
Genomics - methods
Humans
instrumentation
loci
Methods
microorganisms
Nucleotide sequencing
Promoter Regions, Genetic
sequence analysis
Sequence Analysis, DNA
Sequence Analysis, DNA - instrumentation
Sequence Analysis, DNA - methods
title Characterizing and measuring bias in sequence data
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-23T10%3A29%3A24IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Characterizing%20and%20measuring%20bias%20in%20sequence%20data&rft.jtitle=Genome%20Biology%20(Online%20Edition)&rft.au=Ross,%20Michael%20G&rft.date=2013-05-29&rft.volume=14&rft.issue=5&rft.spage=R51&rft.epage=R51&rft.pages=R51-R51&rft.artnum=R51&rft.issn=1465-6906&rft.eissn=1465-6914&rft_id=info:doi/10.1186/gb-2013-14-5-r51&rft_dat=%3Cgale_pubme%3EA534778677%3C/gale_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1419367208&rft_id=info:pmid/23718773&rft_galeid=A534778677&rfr_iscdi=true