Characterizing and measuring bias in sequence data

BACKGROUND: DNA sequencing technologies deviate from the ideal uniform distribution of reads. These biases impair scientific and medical applications. Accordingly, we have developed computational methods for discovering, describing and measuring bias. RESULTS: We applied these methods to the Illumin...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Genome Biology (Online Edition) 2013-05, Vol.14 (5), p.R51-R51, Article R51
Hauptverfasser:	Ross, Michael G, Russ, Carsten, Costello, Maura, Hollinger, Andrew, Lennon, Niall J, Hegarty, Ryan, Nusbaum, Chad, Jaffe, David B
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Base Composition DNA sequencing genome Genome, Bacterial Genome, Human Genome, Protozoan Genomes Genomics Genomics - methods Humans instrumentation loci Methods microorganisms Nucleotide sequencing Promoter Regions, Genetic sequence analysis Sequence Analysis, DNA Sequence Analysis, DNA - instrumentation Sequence Analysis, DNA - methods
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	R51
container_issue	5
container_start_page	R51
container_title	Genome Biology (Online Edition)
container_volume	14
creator	Ross, Michael G Russ, Carsten Costello, Maura Hollinger, Andrew Lennon, Niall J Hegarty, Ryan Nusbaum, Chad Jaffe, David B
description	BACKGROUND: DNA sequencing technologies deviate from the ideal uniform distribution of reads. These biases impair scientific and medical applications. Accordingly, we have developed computational methods for discovering, describing and measuring bias. RESULTS: We applied these methods to the Illumina, Ion Torrent, Pacific Biosciences and Complete Genomics sequencing platforms, using data from human and from a set of microbes with diverse base compositions. As in previous work, library construction conditions significantly influence sequencing bias. Pacific Biosciences coverage levels are the least biased, followed by Illumina, although all technologies exhibit error-rate biases in high- and low-GC regions and at long homopolymer runs. The GC-rich regions prone to low coverage include a number of human promoters, so we therefore catalog 1,000 that were exceptionally resistant to sequencing. Our results indicate that combining data from two technologies can reduce coverage bias if the biases in the component technologies are complementary and of similar magnitude. Analysis of Illumina data representing 120-fold coverage of a well-studied human sample reveals that 0.20% of the autosomal genome was covered at less than 10% of the genome-wide average. Excluding locations that were similar to known bias motifs or likely due to sample-reference variations left only 0.045% of the autosomal genome with unexplained poor coverage. CONCLUSIONS: The assays presented in this paper provide a comprehensive view of sequencing bias, which can be used to drive laboratory improvements and to monitor production processes. Development guided by these assays should result in improved genome assemblies and better coverage of biologically important loci.
doi_str_mv	10.1186/gb-2013-14-5-r51
format	Article
fullrecord	<record><control><sourceid>gale_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_4053816</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A534778677</galeid><sourcerecordid>A534778677</sourcerecordid><originalsourceid>FETCH-LOGICAL-b747t-45a4dc4f327c9ee13a89d8398b49f664ba236caca6d55da38cacdadb3043ab73</originalsourceid><addsrcrecordid>eNqNks1rFjEQxhdRbK3ePeke9bA12XxfhPJStVhQsJ7D5GO3kd1NTXZF-9ebZeuLLyg0l0wyv3kYnpmqeo7RKcaSv-lN0yJMGkwb1iSGH1THmHLWcIXpw32M-FH1JOdvCGEhOX9cHbVEYCkEOa7a3TUksLNP4TZMfQ2Tq0cPeUnrywTIdZjq7L8vfrK-djDD0-pRB0P2z-7uk-rq3fnV7kNz-en9xe7ssjGCirmhDKiztCOtsMp7TEAqJ4mShqqOc2qgJdyCBe4Yc0BkiR04QxAlYAQ5qd5usjeLGb2zfpoTDPomhRHSLx0h6MPMFK51H39oihiRmBeB3SZgQvyPwGHGxlH3Rq-Oakw108XRovLqro0Uiwl51mPI1g8DTD4uWWNVDlVM3gNlVHJCWnUPlGJFuGiRLOjphvYweB2mLpZmV9-cH4ONk-9C-T9jhIoyXbEa9_qgoDCz_zn3sOSsP36-OGTRxtoUc06-29uDkV5X7F-GvPh7LvuCPztVgJcb0EHU0KeQ9dcvRYMihBRiVJHfSPzWYA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1419367208</pqid></control><display><type>article</type><title>Characterizing and measuring bias in sequence data</title><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>Springer Nature OA Free Journals</source><source>Springer Nature - Complete Springer Journals</source><source>PubMed Central</source><creator>Ross, Michael G ; Russ, Carsten ; Costello, Maura ; Hollinger, Andrew ; Lennon, Niall J ; Hegarty, Ryan ; Nusbaum, Chad ; Jaffe, David B</creator><creatorcontrib>Ross, Michael G ; Russ, Carsten ; Costello, Maura ; Hollinger, Andrew ; Lennon, Niall J ; Hegarty, Ryan ; Nusbaum, Chad ; Jaffe, David B</creatorcontrib><description>BACKGROUND: DNA sequencing technologies deviate from the ideal uniform distribution of reads. These biases impair scientific and medical applications. Accordingly, we have developed computational methods for discovering, describing and measuring bias. RESULTS: We applied these methods to the Illumina, Ion Torrent, Pacific Biosciences and Complete Genomics sequencing platforms, using data from human and from a set of microbes with diverse base compositions. As in previous work, library construction conditions significantly influence sequencing bias. Pacific Biosciences coverage levels are the least biased, followed by Illumina, although all technologies exhibit error-rate biases in high- and low-GC regions and at long homopolymer runs. The GC-rich regions prone to low coverage include a number of human promoters, so we therefore catalog 1,000 that were exceptionally resistant to sequencing. Our results indicate that combining data from two technologies can reduce coverage bias if the biases in the component technologies are complementary and of similar magnitude. Analysis of Illumina data representing 120-fold coverage of a well-studied human sample reveals that 0.20% of the autosomal genome was covered at less than 10% of the genome-wide average. Excluding locations that were similar to known bias motifs or likely due to sample-reference variations left only 0.045% of the autosomal genome with unexplained poor coverage. CONCLUSIONS: The assays presented in this paper provide a comprehensive view of sequencing bias, which can be used to drive laboratory improvements and to monitor production processes. Development guided by these assays should result in improved genome assemblies and better coverage of biologically important loci.</description><identifier>ISSN: 1465-6906</identifier><identifier>ISSN: 1474-760X</identifier><identifier>EISSN: 1465-6914</identifier><identifier>EISSN: 1474-760X</identifier><identifier>DOI: 10.1186/gb-2013-14-5-r51</identifier><identifier>PMID: 23718773</identifier><language>eng</language><publisher>England: Springer-Verlag</publisher><subject>Algorithms ; Base Composition ; DNA sequencing ; genome ; Genome, Bacterial ; Genome, Human ; Genome, Protozoan ; Genomes ; Genomics ; Genomics - methods ; Humans ; instrumentation ; loci ; Methods ; microorganisms ; Nucleotide sequencing ; Promoter Regions, Genetic ; sequence analysis ; Sequence Analysis, DNA ; Sequence Analysis, DNA - instrumentation ; Sequence Analysis, DNA - methods</subject><ispartof>Genome Biology (Online Edition), 2013-05, Vol.14 (5), p.R51-R51, Article R51</ispartof><rights>COPYRIGHT 2013 BioMed Central Ltd.</rights><rights>Copyright © 2013 Ross et al.; licensee BioMed Central Ltd. 2013 Ross et al.; licensee BioMed Central Ltd.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-b747t-45a4dc4f327c9ee13a89d8398b49f664ba236caca6d55da38cacdadb3043ab73</citedby><cites>FETCH-LOGICAL-b747t-45a4dc4f327c9ee13a89d8398b49f664ba236caca6d55da38cacdadb3043ab73</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053816/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053816/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,860,881,27903,27904,53770,53772</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/23718773$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Ross, Michael G</creatorcontrib><creatorcontrib>Russ, Carsten</creatorcontrib><creatorcontrib>Costello, Maura</creatorcontrib><creatorcontrib>Hollinger, Andrew</creatorcontrib><creatorcontrib>Lennon, Niall J</creatorcontrib><creatorcontrib>Hegarty, Ryan</creatorcontrib><creatorcontrib>Nusbaum, Chad</creatorcontrib><creatorcontrib>Jaffe, David B</creatorcontrib><title>Characterizing and measuring bias in sequence data</title><title>Genome Biology (Online Edition)</title><addtitle>Genome Biol</addtitle><description>BACKGROUND: DNA sequencing technologies deviate from the ideal uniform distribution of reads. These biases impair scientific and medical applications. Accordingly, we have developed computational methods for discovering, describing and measuring bias. RESULTS: We applied these methods to the Illumina, Ion Torrent, Pacific Biosciences and Complete Genomics sequencing platforms, using data from human and from a set of microbes with diverse base compositions. As in previous work, library construction conditions significantly influence sequencing bias. Pacific Biosciences coverage levels are the least biased, followed by Illumina, although all technologies exhibit error-rate biases in high- and low-GC regions and at long homopolymer runs. The GC-rich regions prone to low coverage include a number of human promoters, so we therefore catalog 1,000 that were exceptionally resistant to sequencing. Our results indicate that combining data from two technologies can reduce coverage bias if the biases in the component technologies are complementary and of similar magnitude. Analysis of Illumina data representing 120-fold coverage of a well-studied human sample reveals that 0.20% of the autosomal genome was covered at less than 10% of the genome-wide average. Excluding locations that were similar to known bias motifs or likely due to sample-reference variations left only 0.045% of the autosomal genome with unexplained poor coverage. CONCLUSIONS: The assays presented in this paper provide a comprehensive view of sequencing bias, which can be used to drive laboratory improvements and to monitor production processes. Development guided by these assays should result in improved genome assemblies and better coverage of biologically important loci.</description><subject>Algorithms</subject><subject>Base Composition</subject><subject>DNA sequencing</subject><subject>genome</subject><subject>Genome, Bacterial</subject><subject>Genome, Human</subject><subject>Genome, Protozoan</subject><subject>Genomes</subject><subject>Genomics</subject><subject>Genomics - methods</subject><subject>Humans</subject><subject>instrumentation</subject><subject>loci</subject><subject>Methods</subject><subject>microorganisms</subject><subject>Nucleotide sequencing</subject><subject>Promoter Regions, Genetic</subject><subject>sequence analysis</subject><subject>Sequence Analysis, DNA</subject><subject>Sequence Analysis, DNA - instrumentation</subject><subject>Sequence Analysis, DNA - methods</subject><issn>1465-6906</issn><issn>1474-760X</issn><issn>1465-6914</issn><issn>1474-760X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2013</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><sourceid>KPI</sourceid><recordid>eNqNks1rFjEQxhdRbK3ePeke9bA12XxfhPJStVhQsJ7D5GO3kd1NTXZF-9ebZeuLLyg0l0wyv3kYnpmqeo7RKcaSv-lN0yJMGkwb1iSGH1THmHLWcIXpw32M-FH1JOdvCGEhOX9cHbVEYCkEOa7a3TUksLNP4TZMfQ2Tq0cPeUnrywTIdZjq7L8vfrK-djDD0-pRB0P2z-7uk-rq3fnV7kNz-en9xe7ssjGCirmhDKiztCOtsMp7TEAqJ4mShqqOc2qgJdyCBe4Yc0BkiR04QxAlYAQ5qd5usjeLGb2zfpoTDPomhRHSLx0h6MPMFK51H39oihiRmBeB3SZgQvyPwGHGxlH3Rq-Oakw108XRovLqro0Uiwl51mPI1g8DTD4uWWNVDlVM3gNlVHJCWnUPlGJFuGiRLOjphvYweB2mLpZmV9-cH4ONk-9C-T9jhIoyXbEa9_qgoDCz_zn3sOSsP36-OGTRxtoUc06-29uDkV5X7F-GvPh7LvuCPztVgJcb0EHU0KeQ9dcvRYMihBRiVJHfSPzWYA</recordid><startdate>20130529</startdate><enddate>20130529</enddate><creator>Ross, Michael G</creator><creator>Russ, Carsten</creator><creator>Costello, Maura</creator><creator>Hollinger, Andrew</creator><creator>Lennon, Niall J</creator><creator>Hegarty, Ryan</creator><creator>Nusbaum, Chad</creator><creator>Jaffe, David B</creator><general>Springer-Verlag</general><general>BioMed Central Ltd</general><general>BioMed Central</general><scope>FBQ</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>KPI</scope><scope>IAO</scope><scope>7TM</scope><scope>8FD</scope><scope>FR3</scope><scope>P64</scope><scope>RC3</scope><scope>7X8</scope><scope>7S9</scope><scope>L.6</scope><scope>5PM</scope></search><sort><creationdate>20130529</creationdate><title>Characterizing and measuring bias in sequence data</title><author>Ross, Michael G ; Russ, Carsten ; Costello, Maura ; Hollinger, Andrew ; Lennon, Niall J ; Hegarty, Ryan ; Nusbaum, Chad ; Jaffe, David B</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-b747t-45a4dc4f327c9ee13a89d8398b49f664ba236caca6d55da38cacdadb3043ab73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2013</creationdate><topic>Algorithms</topic><topic>Base Composition</topic><topic>DNA sequencing</topic><topic>genome</topic><topic>Genome, Bacterial</topic><topic>Genome, Human</topic><topic>Genome, Protozoan</topic><topic>Genomes</topic><topic>Genomics</topic><topic>Genomics - methods</topic><topic>Humans</topic><topic>instrumentation</topic><topic>loci</topic><topic>Methods</topic><topic>microorganisms</topic><topic>Nucleotide sequencing</topic><topic>Promoter Regions, Genetic</topic><topic>sequence analysis</topic><topic>Sequence Analysis, DNA</topic><topic>Sequence Analysis, DNA - instrumentation</topic><topic>Sequence Analysis, DNA - methods</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ross, Michael G</creatorcontrib><creatorcontrib>Russ, Carsten</creatorcontrib><creatorcontrib>Costello, Maura</creatorcontrib><creatorcontrib>Hollinger, Andrew</creatorcontrib><creatorcontrib>Lennon, Niall J</creatorcontrib><creatorcontrib>Hegarty, Ryan</creatorcontrib><creatorcontrib>Nusbaum, Chad</creatorcontrib><creatorcontrib>Jaffe, David B</creatorcontrib><collection>AGRIS</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Global Issues</collection><collection>Gale Academic OneFile</collection><collection>Nucleic Acids Abstracts</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Genetics Abstracts</collection><collection>MEDLINE - Academic</collection><collection>AGRICOLA</collection><collection>AGRICOLA - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Genome Biology (Online Edition)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ross, Michael G</au><au>Russ, Carsten</au><au>Costello, Maura</au><au>Hollinger, Andrew</au><au>Lennon, Niall J</au><au>Hegarty, Ryan</au><au>Nusbaum, Chad</au><au>Jaffe, David B</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Characterizing and measuring bias in sequence data</atitle><jtitle>Genome Biology (Online Edition)</jtitle><addtitle>Genome Biol</addtitle><date>2013-05-29</date><risdate>2013</risdate><volume>14</volume><issue>5</issue><spage>R51</spage><epage>R51</epage><pages>R51-R51</pages><artnum>R51</artnum><issn>1465-6906</issn><issn>1474-760X</issn><eissn>1465-6914</eissn><eissn>1474-760X</eissn><abstract>BACKGROUND: DNA sequencing technologies deviate from the ideal uniform distribution of reads. These biases impair scientific and medical applications. Accordingly, we have developed computational methods for discovering, describing and measuring bias. RESULTS: We applied these methods to the Illumina, Ion Torrent, Pacific Biosciences and Complete Genomics sequencing platforms, using data from human and from a set of microbes with diverse base compositions. As in previous work, library construction conditions significantly influence sequencing bias. Pacific Biosciences coverage levels are the least biased, followed by Illumina, although all technologies exhibit error-rate biases in high- and low-GC regions and at long homopolymer runs. The GC-rich regions prone to low coverage include a number of human promoters, so we therefore catalog 1,000 that were exceptionally resistant to sequencing. Our results indicate that combining data from two technologies can reduce coverage bias if the biases in the component technologies are complementary and of similar magnitude. Analysis of Illumina data representing 120-fold coverage of a well-studied human sample reveals that 0.20% of the autosomal genome was covered at less than 10% of the genome-wide average. Excluding locations that were similar to known bias motifs or likely due to sample-reference variations left only 0.045% of the autosomal genome with unexplained poor coverage. CONCLUSIONS: The assays presented in this paper provide a comprehensive view of sequencing bias, which can be used to drive laboratory improvements and to monitor production processes. Development guided by these assays should result in improved genome assemblies and better coverage of biologically important loci.</abstract><cop>England</cop><pub>Springer-Verlag</pub><pmid>23718773</pmid><doi>10.1186/gb-2013-14-5-r51</doi><tpages>1</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1465-6906
ispartof	Genome Biology (Online Edition), 2013-05, Vol.14 (5), p.R51-R51, Article R51
issn	1465-6906 1474-760X 1465-6914 1474-760X
language	eng
recordid	cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_4053816
source	MEDLINE; DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; Springer Nature OA Free Journals; Springer Nature - Complete Springer Journals; PubMed Central
subjects	Algorithms Base Composition DNA sequencing genome Genome, Bacterial Genome, Human Genome, Protozoan Genomes Genomics Genomics - methods Humans instrumentation loci Methods microorganisms Nucleotide sequencing Promoter Regions, Genetic sequence analysis Sequence Analysis, DNA Sequence Analysis, DNA - instrumentation Sequence Analysis, DNA - methods
title	Characterizing and measuring bias in sequence data
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-23T10%3A29%3A24IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Characterizing%20and%20measuring%20bias%20in%20sequence%20data&rft.jtitle=Genome%20Biology%20(Online%20Edition)&rft.au=Ross,%20Michael%20G&rft.date=2013-05-29&rft.volume=14&rft.issue=5&rft.spage=R51&rft.epage=R51&rft.pages=R51-R51&rft.artnum=R51&rft.issn=1465-6906&rft.eissn=1465-6914&rft_id=info:doi/10.1186/gb-2013-14-5-r51&rft_dat=%3Cgale_pubme%3EA534778677%3C/gale_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1419367208&rft_id=info:pmid/23718773&rft_galeid=A534778677&rfr_iscdi=true