Consensus Genotyper for Exome Sequencing (CGES): improving the quality of exome variant genotypes

The development of cost-effective next-generation sequencing methods has spurred the development of high-throughput bioinformatics tools for detection of sequence variation. With many disparate variant-calling algorithms available, investigators must ask, 'Which method is best for my data?'...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Bioinformatics (Oxford, England) England), 2015-01, Vol.31 (2), p.187-193
Hauptverfasser:	Trubetskoy, Vassily, Rodriguez, Alex, Dave, Uptal, Campbell, Nicholas, Crawford, Emily L, Cook, Edwin H, Sutcliffe, James S, Foster, Ian, Madduri, Ravi, Cox, Nancy J, Davis, Lea K
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Autistic Disorder - genetics Consensus Sequence Data Interpretation, Statistical Exome - genetics Genetic Testing Genotype High-Throughput Nucleotide Sequencing - methods Humans Original Papers Polymorphism, Single Nucleotide - genetics Software
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	193
container_issue	2
container_start_page	187
container_title	Bioinformatics (Oxford, England)
container_volume	31
creator	Trubetskoy, Vassily Rodriguez, Alex Dave, Uptal Campbell, Nicholas Crawford, Emily L Cook, Edwin H Sutcliffe, James S Foster, Ian Madduri, Ravi Cox, Nancy J Davis, Lea K
description	The development of cost-effective next-generation sequencing methods has spurred the development of high-throughput bioinformatics tools for detection of sequence variation. With many disparate variant-calling algorithms available, investigators must ask, 'Which method is best for my data?' Machine learning research has shown that so-called ensemble methods that combine the output of multiple models can dramatically improve classifier performance. Here we describe a novel variant-calling approach based on an ensemble of variant-calling algorithms, which we term the Consensus Genotyper for Exome Sequencing (CGES). CGES uses a two-stage voting scheme among four algorithm implementations. While our ensemble method can accept variants generated by any variant-calling algorithm, we used GATK2.8, SAMtools, FreeBayes and Atlas-SNP2 in building CGES because of their performance, widespread adoption and diverse but complementary algorithms. We apply CGES to 132 samples sequenced at the Hudson Alpha Institute for Biotechnology (HAIB, Huntsville, AL) using the Nimblegen Exome Capture and Illumina sequencing technology. Our sample set consisted of 40 complete trios, two families of four, one parent-child duo and two unrelated individuals. CGES yielded the fewest total variant calls (N(CGES) = 139° 897), the highest Ts/Tv ratio (3.02), the lowest Mendelian error rate across all genotypes (0.028%), the highest rediscovery rate from the Exome Variant Server (EVS; 89.3%) and 1000 Genomes (1KG; 84.1%) and the highest positive predictive value (PPV; 96.1%) for a random sample of previously validated de novo variants. We describe these and other quality control (QC) metrics from consensus data and explain how the CGES pipeline can be used to generate call sets of varying quality stringency, including consensus calls present across all four algorithms, calls that are consistent across any three out of four algorithms, calls that are consistent across any two out of four algorithms or a more liberal set of all calls made by any algorithm. To enable accessible, efficient and reproducible analysis, we implement CGES both as a stand-alone command line tool available for download in GitHub and as a set of Galaxy tools and workflows configured to execute on parallel computers. Supplementary data are available at Bioinformatics online.
doi_str_mv	10.1093/bioinformatics/btu591
format	Article
fullrecord	<record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_4287941</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1645230657</sourcerecordid><originalsourceid>FETCH-LOGICAL-c411t-ff294fe46ba13c173c57bfe0b916926d15b2913469a4c1803ab7f119c4f1b6e43</originalsourceid><addsrcrecordid>eNpVkU1PGzEQhq2qFV_lJ4B8hEPAs_7YmEOlKgqhElIPtGfLNnYw2rWD7Y2af89C0ghOM5qZ95kZvQidAbkCIum1CSlEn3Kva7Dl2tSBS_iCjoCKdsKmAF_3OaGH6LiUZ0IIJ1wcoMOGNy0RdHqE9CzF4mIZCl64mOpm5TIesXj-L_UOP7iXwUUb4hJfzBbzh8sbHPpVTuu3Sn1y-GXQXagbnDx274q1zkHHipc7WvmOvnndFXe6iyfo7-38z-xucv978Wv2835iGUCdeN9I5h0TRgO10FLLW-MdMRKEbMQjcNNIoExIzSyMP2nTegBpmQcjHKMn6MeWuxpM7x6tizXrTq1y6HXeqKSD-tyJ4Ukt01qxZtpKBiPgYgfIafy6VNWHYl3X6ejSUBQIxhtKBG_HUb4dtTmVkp3frwGi3uxRn-1RW3tG3fnHG_eq_37QV3XJk6s</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1645230657</pqid></control><display><type>article</type><title>Consensus Genotyper for Exome Sequencing (CGES): improving the quality of exome variant genotypes</title><source>Oxford Journals Open Access Collection</source><source>MEDLINE</source><source>EZB-FREE-00999 freely available EZB journals</source><source>PubMed Central</source><source>Alma/SFX Local Collection</source><creator>Trubetskoy, Vassily ; Rodriguez, Alex ; Dave, Uptal ; Campbell, Nicholas ; Crawford, Emily L ; Cook, Edwin H ; Sutcliffe, James S ; Foster, Ian ; Madduri, Ravi ; Cox, Nancy J ; Davis, Lea K</creator><creatorcontrib>Trubetskoy, Vassily ; Rodriguez, Alex ; Dave, Uptal ; Campbell, Nicholas ; Crawford, Emily L ; Cook, Edwin H ; Sutcliffe, James S ; Foster, Ian ; Madduri, Ravi ; Cox, Nancy J ; Davis, Lea K</creatorcontrib><description>The development of cost-effective next-generation sequencing methods has spurred the development of high-throughput bioinformatics tools for detection of sequence variation. With many disparate variant-calling algorithms available, investigators must ask, 'Which method is best for my data?' Machine learning research has shown that so-called ensemble methods that combine the output of multiple models can dramatically improve classifier performance. Here we describe a novel variant-calling approach based on an ensemble of variant-calling algorithms, which we term the Consensus Genotyper for Exome Sequencing (CGES). CGES uses a two-stage voting scheme among four algorithm implementations. While our ensemble method can accept variants generated by any variant-calling algorithm, we used GATK2.8, SAMtools, FreeBayes and Atlas-SNP2 in building CGES because of their performance, widespread adoption and diverse but complementary algorithms. We apply CGES to 132 samples sequenced at the Hudson Alpha Institute for Biotechnology (HAIB, Huntsville, AL) using the Nimblegen Exome Capture and Illumina sequencing technology. Our sample set consisted of 40 complete trios, two families of four, one parent-child duo and two unrelated individuals. CGES yielded the fewest total variant calls (N(CGES) = 139° 897), the highest Ts/Tv ratio (3.02), the lowest Mendelian error rate across all genotypes (0.028%), the highest rediscovery rate from the Exome Variant Server (EVS; 89.3%) and 1000 Genomes (1KG; 84.1%) and the highest positive predictive value (PPV; 96.1%) for a random sample of previously validated de novo variants. We describe these and other quality control (QC) metrics from consensus data and explain how the CGES pipeline can be used to generate call sets of varying quality stringency, including consensus calls present across all four algorithms, calls that are consistent across any three out of four algorithms, calls that are consistent across any two out of four algorithms or a more liberal set of all calls made by any algorithm. To enable accessible, efficient and reproducible analysis, we implement CGES both as a stand-alone command line tool available for download in GitHub and as a set of Galaxy tools and workflows configured to execute on parallel computers. Supplementary data are available at Bioinformatics online.</description><identifier>ISSN: 1367-4803</identifier><identifier>EISSN: 1367-4811</identifier><identifier>DOI: 10.1093/bioinformatics/btu591</identifier><identifier>PMID: 25270638</identifier><language>eng</language><publisher>England: Oxford University Press</publisher><subject>Algorithms ; Autistic Disorder - genetics ; Consensus Sequence ; Data Interpretation, Statistical ; Exome - genetics ; Genetic Testing ; Genotype ; High-Throughput Nucleotide Sequencing - methods ; Humans ; Original Papers ; Polymorphism, Single Nucleotide - genetics ; Software</subject><ispartof>Bioinformatics (Oxford, England), 2015-01, Vol.31 (2), p.187-193</ispartof><rights>The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.</rights><rights>The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com 2014</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c411t-ff294fe46ba13c173c57bfe0b916926d15b2913469a4c1803ab7f119c4f1b6e43</citedby><cites>FETCH-LOGICAL-c411t-ff294fe46ba13c173c57bfe0b916926d15b2913469a4c1803ab7f119c4f1b6e43</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4287941/pdf/$$EPDF$$P50$$Gpubmedcentral$$H</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4287941/$$EHTML$$P50$$Gpubmedcentral$$H</linktohtml><link.rule.ids>230,314,723,776,780,881,27901,27902,53766,53768</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/25270638$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Trubetskoy, Vassily</creatorcontrib><creatorcontrib>Rodriguez, Alex</creatorcontrib><creatorcontrib>Dave, Uptal</creatorcontrib><creatorcontrib>Campbell, Nicholas</creatorcontrib><creatorcontrib>Crawford, Emily L</creatorcontrib><creatorcontrib>Cook, Edwin H</creatorcontrib><creatorcontrib>Sutcliffe, James S</creatorcontrib><creatorcontrib>Foster, Ian</creatorcontrib><creatorcontrib>Madduri, Ravi</creatorcontrib><creatorcontrib>Cox, Nancy J</creatorcontrib><creatorcontrib>Davis, Lea K</creatorcontrib><title>Consensus Genotyper for Exome Sequencing (CGES): improving the quality of exome variant genotypes</title><title>Bioinformatics (Oxford, England)</title><addtitle>Bioinformatics</addtitle><description>The development of cost-effective next-generation sequencing methods has spurred the development of high-throughput bioinformatics tools for detection of sequence variation. With many disparate variant-calling algorithms available, investigators must ask, 'Which method is best for my data?' Machine learning research has shown that so-called ensemble methods that combine the output of multiple models can dramatically improve classifier performance. Here we describe a novel variant-calling approach based on an ensemble of variant-calling algorithms, which we term the Consensus Genotyper for Exome Sequencing (CGES). CGES uses a two-stage voting scheme among four algorithm implementations. While our ensemble method can accept variants generated by any variant-calling algorithm, we used GATK2.8, SAMtools, FreeBayes and Atlas-SNP2 in building CGES because of their performance, widespread adoption and diverse but complementary algorithms. We apply CGES to 132 samples sequenced at the Hudson Alpha Institute for Biotechnology (HAIB, Huntsville, AL) using the Nimblegen Exome Capture and Illumina sequencing technology. Our sample set consisted of 40 complete trios, two families of four, one parent-child duo and two unrelated individuals. CGES yielded the fewest total variant calls (N(CGES) = 139° 897), the highest Ts/Tv ratio (3.02), the lowest Mendelian error rate across all genotypes (0.028%), the highest rediscovery rate from the Exome Variant Server (EVS; 89.3%) and 1000 Genomes (1KG; 84.1%) and the highest positive predictive value (PPV; 96.1%) for a random sample of previously validated de novo variants. We describe these and other quality control (QC) metrics from consensus data and explain how the CGES pipeline can be used to generate call sets of varying quality stringency, including consensus calls present across all four algorithms, calls that are consistent across any three out of four algorithms, calls that are consistent across any two out of four algorithms or a more liberal set of all calls made by any algorithm. To enable accessible, efficient and reproducible analysis, we implement CGES both as a stand-alone command line tool available for download in GitHub and as a set of Galaxy tools and workflows configured to execute on parallel computers. Supplementary data are available at Bioinformatics online.</description><subject>Algorithms</subject><subject>Autistic Disorder - genetics</subject><subject>Consensus Sequence</subject><subject>Data Interpretation, Statistical</subject><subject>Exome - genetics</subject><subject>Genetic Testing</subject><subject>Genotype</subject><subject>High-Throughput Nucleotide Sequencing - methods</subject><subject>Humans</subject><subject>Original Papers</subject><subject>Polymorphism, Single Nucleotide - genetics</subject><subject>Software</subject><issn>1367-4803</issn><issn>1367-4811</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2015</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNpVkU1PGzEQhq2qFV_lJ4B8hEPAs_7YmEOlKgqhElIPtGfLNnYw2rWD7Y2af89C0ghOM5qZ95kZvQidAbkCIum1CSlEn3Kva7Dl2tSBS_iCjoCKdsKmAF_3OaGH6LiUZ0IIJ1wcoMOGNy0RdHqE9CzF4mIZCl64mOpm5TIesXj-L_UOP7iXwUUb4hJfzBbzh8sbHPpVTuu3Sn1y-GXQXagbnDx274q1zkHHipc7WvmOvnndFXe6iyfo7-38z-xucv978Wv2835iGUCdeN9I5h0TRgO10FLLW-MdMRKEbMQjcNNIoExIzSyMP2nTegBpmQcjHKMn6MeWuxpM7x6tizXrTq1y6HXeqKSD-tyJ4Ukt01qxZtpKBiPgYgfIafy6VNWHYl3X6ejSUBQIxhtKBG_HUb4dtTmVkp3frwGi3uxRn-1RW3tG3fnHG_eq_37QV3XJk6s</recordid><startdate>20150115</startdate><enddate>20150115</enddate><creator>Trubetskoy, Vassily</creator><creator>Rodriguez, Alex</creator><creator>Dave, Uptal</creator><creator>Campbell, Nicholas</creator><creator>Crawford, Emily L</creator><creator>Cook, Edwin H</creator><creator>Sutcliffe, James S</creator><creator>Foster, Ian</creator><creator>Madduri, Ravi</creator><creator>Cox, Nancy J</creator><creator>Davis, Lea K</creator><general>Oxford University Press</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>5PM</scope></search><sort><creationdate>20150115</creationdate><title>Consensus Genotyper for Exome Sequencing (CGES): improving the quality of exome variant genotypes</title><author>Trubetskoy, Vassily ; Rodriguez, Alex ; Dave, Uptal ; Campbell, Nicholas ; Crawford, Emily L ; Cook, Edwin H ; Sutcliffe, James S ; Foster, Ian ; Madduri, Ravi ; Cox, Nancy J ; Davis, Lea K</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c411t-ff294fe46ba13c173c57bfe0b916926d15b2913469a4c1803ab7f119c4f1b6e43</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2015</creationdate><topic>Algorithms</topic><topic>Autistic Disorder - genetics</topic><topic>Consensus Sequence</topic><topic>Data Interpretation, Statistical</topic><topic>Exome - genetics</topic><topic>Genetic Testing</topic><topic>Genotype</topic><topic>High-Throughput Nucleotide Sequencing - methods</topic><topic>Humans</topic><topic>Original Papers</topic><topic>Polymorphism, Single Nucleotide - genetics</topic><topic>Software</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Trubetskoy, Vassily</creatorcontrib><creatorcontrib>Rodriguez, Alex</creatorcontrib><creatorcontrib>Dave, Uptal</creatorcontrib><creatorcontrib>Campbell, Nicholas</creatorcontrib><creatorcontrib>Crawford, Emily L</creatorcontrib><creatorcontrib>Cook, Edwin H</creatorcontrib><creatorcontrib>Sutcliffe, James S</creatorcontrib><creatorcontrib>Foster, Ian</creatorcontrib><creatorcontrib>Madduri, Ravi</creatorcontrib><creatorcontrib>Cox, Nancy J</creatorcontrib><creatorcontrib>Davis, Lea K</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Bioinformatics (Oxford, England)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Trubetskoy, Vassily</au><au>Rodriguez, Alex</au><au>Dave, Uptal</au><au>Campbell, Nicholas</au><au>Crawford, Emily L</au><au>Cook, Edwin H</au><au>Sutcliffe, James S</au><au>Foster, Ian</au><au>Madduri, Ravi</au><au>Cox, Nancy J</au><au>Davis, Lea K</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Consensus Genotyper for Exome Sequencing (CGES): improving the quality of exome variant genotypes</atitle><jtitle>Bioinformatics (Oxford, England)</jtitle><addtitle>Bioinformatics</addtitle><date>2015-01-15</date><risdate>2015</risdate><volume>31</volume><issue>2</issue><spage>187</spage><epage>193</epage><pages>187-193</pages><issn>1367-4803</issn><eissn>1367-4811</eissn><abstract>The development of cost-effective next-generation sequencing methods has spurred the development of high-throughput bioinformatics tools for detection of sequence variation. With many disparate variant-calling algorithms available, investigators must ask, 'Which method is best for my data?' Machine learning research has shown that so-called ensemble methods that combine the output of multiple models can dramatically improve classifier performance. Here we describe a novel variant-calling approach based on an ensemble of variant-calling algorithms, which we term the Consensus Genotyper for Exome Sequencing (CGES). CGES uses a two-stage voting scheme among four algorithm implementations. While our ensemble method can accept variants generated by any variant-calling algorithm, we used GATK2.8, SAMtools, FreeBayes and Atlas-SNP2 in building CGES because of their performance, widespread adoption and diverse but complementary algorithms. We apply CGES to 132 samples sequenced at the Hudson Alpha Institute for Biotechnology (HAIB, Huntsville, AL) using the Nimblegen Exome Capture and Illumina sequencing technology. Our sample set consisted of 40 complete trios, two families of four, one parent-child duo and two unrelated individuals. CGES yielded the fewest total variant calls (N(CGES) = 139° 897), the highest Ts/Tv ratio (3.02), the lowest Mendelian error rate across all genotypes (0.028%), the highest rediscovery rate from the Exome Variant Server (EVS; 89.3%) and 1000 Genomes (1KG; 84.1%) and the highest positive predictive value (PPV; 96.1%) for a random sample of previously validated de novo variants. We describe these and other quality control (QC) metrics from consensus data and explain how the CGES pipeline can be used to generate call sets of varying quality stringency, including consensus calls present across all four algorithms, calls that are consistent across any three out of four algorithms, calls that are consistent across any two out of four algorithms or a more liberal set of all calls made by any algorithm. To enable accessible, efficient and reproducible analysis, we implement CGES both as a stand-alone command line tool available for download in GitHub and as a set of Galaxy tools and workflows configured to execute on parallel computers. Supplementary data are available at Bioinformatics online.</abstract><cop>England</cop><pub>Oxford University Press</pub><pmid>25270638</pmid><doi>10.1093/bioinformatics/btu591</doi><tpages>7</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1367-4803
ispartof	Bioinformatics (Oxford, England), 2015-01, Vol.31 (2), p.187-193
issn	1367-4803 1367-4811
language	eng
recordid	cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_4287941
source	Oxford Journals Open Access Collection; MEDLINE; EZB-FREE-00999 freely available EZB journals; PubMed Central; Alma/SFX Local Collection
subjects	Algorithms Autistic Disorder - genetics Consensus Sequence Data Interpretation, Statistical Exome - genetics Genetic Testing Genotype High-Throughput Nucleotide Sequencing - methods Humans Original Papers Polymorphism, Single Nucleotide - genetics Software
title	Consensus Genotyper for Exome Sequencing (CGES): improving the quality of exome variant genotypes
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-04T07%3A53%3A43IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Consensus%20Genotyper%20for%20Exome%20Sequencing%20(CGES):%20improving%20the%20quality%20of%20exome%20variant%20genotypes&rft.jtitle=Bioinformatics%20(Oxford,%20England)&rft.au=Trubetskoy,%20Vassily&rft.date=2015-01-15&rft.volume=31&rft.issue=2&rft.spage=187&rft.epage=193&rft.pages=187-193&rft.issn=1367-4803&rft.eissn=1367-4811&rft_id=info:doi/10.1093/bioinformatics/btu591&rft_dat=%3Cproquest_pubme%3E1645230657%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1645230657&rft_id=info:pmid/25270638&rfr_iscdi=true