A hybrid computational strategy to address WGS variant analysis in >5000 samples

The decreasing costs of sequencing are driving the need for cost effective and real time variant calling of whole genome sequencing data. The scale of these projects are far beyond the capacity of typical computing resources available with most research labs. Other infrastructures like the cloud AWS...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:BMC bioinformatics 2016-09, Vol.17 (1), p.361-361, Article 361
Hauptverfasser: Huang, Zhuoyi, Rustagi, Navin, Veeraraghavan, Narayanan, Carroll, Andrew, Gibbs, Richard, Boerwinkle, Eric, Venkata, Manjunath Gorentla, Yu, Fuli
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 361
container_issue 1
container_start_page 361
container_title BMC bioinformatics
container_volume 17
creator Huang, Zhuoyi
Rustagi, Navin
Veeraraghavan, Narayanan
Carroll, Andrew
Gibbs, Richard
Boerwinkle, Eric
Venkata, Manjunath Gorentla
Yu, Fuli
description The decreasing costs of sequencing are driving the need for cost effective and real time variant calling of whole genome sequencing data. The scale of these projects are far beyond the capacity of typical computing resources available with most research labs. Other infrastructures like the cloud AWS environment and supercomputers also have limitations due to which large scale joint variant calling becomes infeasible, and infrastructure specific variant calling strategies either fail to scale up to large datasets or abandon joint calling strategies. We present a high throughput framework including multiple variant callers for single nucleotide variant (SNV) calling, which leverages hybrid computing infrastructure consisting of cloud AWS, supercomputers and local high performance computing infrastructures. We present a novel binning approach for large scale joint variant calling and imputation which can scale up to over 10,000 samples while producing SNV callsets with high sensitivity and specificity. As a proof of principle, we present results of analysis on Cohorts for Heart And Aging Research in Genomic Epidemiology (CHARGE) WGS freeze 3 dataset in which joint calling, imputation and phasing of over 5300 whole genome samples was produced in under 6 weeks using four state-of-the-art callers. The callers used were SNPTools, GATK-HaplotypeCaller, GATK-UnifiedGenotyper and GotCloud. We used Amazon AWS, a 4000-core in-house cluster at Baylor College of Medicine, IBM power PC Blue BioU at Rice and Rhea at Oak Ridge National Laboratory (ORNL) for the computation. AWS was used for joint calling of 180 TB of BAM files, and ORNL and Rice supercomputers were used for the imputation and phasing step. All other steps were carried out on the local compute cluster. The entire operation used 5.2 million core hours and only transferred a total of 6 TB of data across the platforms. Even with increasing sizes of whole genome datasets, ensemble joint calling of SNVs for low coverage data can be accomplished in a scalable, cost effective and fast manner by using heterogeneous computing platforms without compromising on the quality of variants.
doi_str_mv 10.1186/s12859-016-1211-6
format Article
fullrecord <record><control><sourceid>gale_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_5018196</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A465629003</galeid><sourcerecordid>A465629003</sourcerecordid><originalsourceid>FETCH-LOGICAL-c621t-f073cc5169f90307e3ae4ccbdbae804afb1978246108d538b4413578637fc3d23</originalsourceid><addsrcrecordid>eNptklFrFDEUhQdRbK3-AF8k6Is-TM3NJJnJS2EpWgsFxSo-hkwms5syM1lzM8X992bYWrsieUhIvntyOJyieAn0FKCR7xFYI1RJQZbAAEr5qDgGXkPJgIrHD85HxTPEG0qhbqh4WhyxWgLjXB0XX1Zks2uj74gN43ZOJvkwmYFgiia59Y6kQEzXRYdIflxck1sTvZkSMRnaoUfiJ3ImKKUEzbgdHD4vnvRmQPfibj8pvn_88O38U3n1-eLyfHVVWskglT2tK2sFSNUrWtHaVcZxa9uuNa6h3PQtqLphXAJtOlE1LedQibqRVd3bqmPVSXG2193O7eg666bseNDb6EcTdzoYrw9fJr_R63CrBYUGlMwCr_cCAZPXaH1ydmPDNDmbNAgpBOMZenv3Sww_Z4dJjx6tGwYzuTCjXqSAMaVURt_8g96EOeaYFopxVTMB_C-1NoPTfupDNmcXUb3iUkimKK0ydfofKq_OjT57dL3P9wcD7w4GMpPcr7Q2M6K-vP56yMKetTEgRtffhwZUL73S-17p3Cu99EovYb16mPb9xJ8iVb8B1gbEaA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1824972514</pqid></control><display><type>article</type><title>A hybrid computational strategy to address WGS variant analysis in &gt;5000 samples</title><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>SpringerNature Journals</source><source>PubMed Central Open Access</source><source>PubMed Central</source><source>Springer Nature OA/Free Journals</source><creator>Huang, Zhuoyi ; Rustagi, Navin ; Veeraraghavan, Narayanan ; Carroll, Andrew ; Gibbs, Richard ; Boerwinkle, Eric ; Venkata, Manjunath Gorentla ; Yu, Fuli</creator><creatorcontrib>Huang, Zhuoyi ; Rustagi, Navin ; Veeraraghavan, Narayanan ; Carroll, Andrew ; Gibbs, Richard ; Boerwinkle, Eric ; Venkata, Manjunath Gorentla ; Yu, Fuli ; Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)</creatorcontrib><description>The decreasing costs of sequencing are driving the need for cost effective and real time variant calling of whole genome sequencing data. The scale of these projects are far beyond the capacity of typical computing resources available with most research labs. Other infrastructures like the cloud AWS environment and supercomputers also have limitations due to which large scale joint variant calling becomes infeasible, and infrastructure specific variant calling strategies either fail to scale up to large datasets or abandon joint calling strategies. We present a high throughput framework including multiple variant callers for single nucleotide variant (SNV) calling, which leverages hybrid computing infrastructure consisting of cloud AWS, supercomputers and local high performance computing infrastructures. We present a novel binning approach for large scale joint variant calling and imputation which can scale up to over 10,000 samples while producing SNV callsets with high sensitivity and specificity. As a proof of principle, we present results of analysis on Cohorts for Heart And Aging Research in Genomic Epidemiology (CHARGE) WGS freeze 3 dataset in which joint calling, imputation and phasing of over 5300 whole genome samples was produced in under 6 weeks using four state-of-the-art callers. The callers used were SNPTools, GATK-HaplotypeCaller, GATK-UnifiedGenotyper and GotCloud. We used Amazon AWS, a 4000-core in-house cluster at Baylor College of Medicine, IBM power PC Blue BioU at Rice and Rhea at Oak Ridge National Laboratory (ORNL) for the computation. AWS was used for joint calling of 180 TB of BAM files, and ORNL and Rice supercomputers were used for the imputation and phasing step. All other steps were carried out on the local compute cluster. The entire operation used 5.2 million core hours and only transferred a total of 6 TB of data across the platforms. Even with increasing sizes of whole genome datasets, ensemble joint calling of SNVs for low coverage data can be accomplished in a scalable, cost effective and fast manner by using heterogeneous computing platforms without compromising on the quality of variants.</description><identifier>ISSN: 1471-2105</identifier><identifier>EISSN: 1471-2105</identifier><identifier>DOI: 10.1186/s12859-016-1211-6</identifier><identifier>PMID: 27612449</identifier><language>eng</language><publisher>England: BioMed Central Ltd</publisher><subject>BASIC BIOLOGICAL SCIENCES ; Big data ; biochemistry &amp; molecular biology ; biotechnology &amp; applied microbiology ; cloud AWS ; Databases, Genetic ; DNA sequencing ; ensemble calling ; Genetic variation ; Genome, Human ; Genomics - methods ; High-Throughput Nucleotide Sequencing - methods ; Humans ; Information management ; joint calling ; mathematical &amp; computational biology ; Methodology ; Nucleotide sequencing ; scalable ; SNV ; supercomputer ; variant calling ; WGS</subject><ispartof>BMC bioinformatics, 2016-09, Vol.17 (1), p.361-361, Article 361</ispartof><rights>COPYRIGHT 2016 BioMed Central Ltd.</rights><rights>Copyright BioMed Central 2016</rights><rights>The Author(s). 2016</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c621t-f073cc5169f90307e3ae4ccbdbae804afb1978246108d538b4413578637fc3d23</citedby><cites>FETCH-LOGICAL-c621t-f073cc5169f90307e3ae4ccbdbae804afb1978246108d538b4413578637fc3d23</cites><orcidid>0000-0001-9149-295X ; 000000019149295X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC5018196/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC5018196/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,315,728,781,785,865,886,27929,27930,53796,53798</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/27612449$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink><backlink>$$Uhttps://www.osti.gov/servlets/purl/1565524$$D View this record in Osti.gov$$Hfree_for_read</backlink></links><search><creatorcontrib>Huang, Zhuoyi</creatorcontrib><creatorcontrib>Rustagi, Navin</creatorcontrib><creatorcontrib>Veeraraghavan, Narayanan</creatorcontrib><creatorcontrib>Carroll, Andrew</creatorcontrib><creatorcontrib>Gibbs, Richard</creatorcontrib><creatorcontrib>Boerwinkle, Eric</creatorcontrib><creatorcontrib>Venkata, Manjunath Gorentla</creatorcontrib><creatorcontrib>Yu, Fuli</creatorcontrib><creatorcontrib>Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)</creatorcontrib><title>A hybrid computational strategy to address WGS variant analysis in &gt;5000 samples</title><title>BMC bioinformatics</title><addtitle>BMC Bioinformatics</addtitle><description>The decreasing costs of sequencing are driving the need for cost effective and real time variant calling of whole genome sequencing data. The scale of these projects are far beyond the capacity of typical computing resources available with most research labs. Other infrastructures like the cloud AWS environment and supercomputers also have limitations due to which large scale joint variant calling becomes infeasible, and infrastructure specific variant calling strategies either fail to scale up to large datasets or abandon joint calling strategies. We present a high throughput framework including multiple variant callers for single nucleotide variant (SNV) calling, which leverages hybrid computing infrastructure consisting of cloud AWS, supercomputers and local high performance computing infrastructures. We present a novel binning approach for large scale joint variant calling and imputation which can scale up to over 10,000 samples while producing SNV callsets with high sensitivity and specificity. As a proof of principle, we present results of analysis on Cohorts for Heart And Aging Research in Genomic Epidemiology (CHARGE) WGS freeze 3 dataset in which joint calling, imputation and phasing of over 5300 whole genome samples was produced in under 6 weeks using four state-of-the-art callers. The callers used were SNPTools, GATK-HaplotypeCaller, GATK-UnifiedGenotyper and GotCloud. We used Amazon AWS, a 4000-core in-house cluster at Baylor College of Medicine, IBM power PC Blue BioU at Rice and Rhea at Oak Ridge National Laboratory (ORNL) for the computation. AWS was used for joint calling of 180 TB of BAM files, and ORNL and Rice supercomputers were used for the imputation and phasing step. All other steps were carried out on the local compute cluster. The entire operation used 5.2 million core hours and only transferred a total of 6 TB of data across the platforms. Even with increasing sizes of whole genome datasets, ensemble joint calling of SNVs for low coverage data can be accomplished in a scalable, cost effective and fast manner by using heterogeneous computing platforms without compromising on the quality of variants.</description><subject>BASIC BIOLOGICAL SCIENCES</subject><subject>Big data</subject><subject>biochemistry &amp; molecular biology</subject><subject>biotechnology &amp; applied microbiology</subject><subject>cloud AWS</subject><subject>Databases, Genetic</subject><subject>DNA sequencing</subject><subject>ensemble calling</subject><subject>Genetic variation</subject><subject>Genome, Human</subject><subject>Genomics - methods</subject><subject>High-Throughput Nucleotide Sequencing - methods</subject><subject>Humans</subject><subject>Information management</subject><subject>joint calling</subject><subject>mathematical &amp; computational biology</subject><subject>Methodology</subject><subject>Nucleotide sequencing</subject><subject>scalable</subject><subject>SNV</subject><subject>supercomputer</subject><subject>variant calling</subject><subject>WGS</subject><issn>1471-2105</issn><issn>1471-2105</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2016</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNptklFrFDEUhQdRbK3-AF8k6Is-TM3NJJnJS2EpWgsFxSo-hkwms5syM1lzM8X992bYWrsieUhIvntyOJyieAn0FKCR7xFYI1RJQZbAAEr5qDgGXkPJgIrHD85HxTPEG0qhbqh4WhyxWgLjXB0XX1Zks2uj74gN43ZOJvkwmYFgiia59Y6kQEzXRYdIflxck1sTvZkSMRnaoUfiJ3ImKKUEzbgdHD4vnvRmQPfibj8pvn_88O38U3n1-eLyfHVVWskglT2tK2sFSNUrWtHaVcZxa9uuNa6h3PQtqLphXAJtOlE1LedQibqRVd3bqmPVSXG2193O7eg666bseNDb6EcTdzoYrw9fJr_R63CrBYUGlMwCr_cCAZPXaH1ydmPDNDmbNAgpBOMZenv3Sww_Z4dJjx6tGwYzuTCjXqSAMaVURt_8g96EOeaYFopxVTMB_C-1NoPTfupDNmcXUb3iUkimKK0ydfofKq_OjT57dL3P9wcD7w4GMpPcr7Q2M6K-vP56yMKetTEgRtffhwZUL73S-17p3Cu99EovYb16mPb9xJ8iVb8B1gbEaA</recordid><startdate>20160910</startdate><enddate>20160910</enddate><creator>Huang, Zhuoyi</creator><creator>Rustagi, Navin</creator><creator>Veeraraghavan, Narayanan</creator><creator>Carroll, Andrew</creator><creator>Gibbs, Richard</creator><creator>Boerwinkle, Eric</creator><creator>Venkata, Manjunath Gorentla</creator><creator>Yu, Fuli</creator><general>BioMed Central Ltd</general><general>BioMed Central</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>ISR</scope><scope>3V.</scope><scope>7QO</scope><scope>7SC</scope><scope>7X7</scope><scope>7XB</scope><scope>88E</scope><scope>8AL</scope><scope>8AO</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FR3</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>K9.</scope><scope>L7M</scope><scope>LK8</scope><scope>L~C</scope><scope>L~D</scope><scope>M0N</scope><scope>M0S</scope><scope>M1P</scope><scope>M7P</scope><scope>P5Z</scope><scope>P62</scope><scope>P64</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>Q9U</scope><scope>7X8</scope><scope>OIOZB</scope><scope>OTOTI</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0001-9149-295X</orcidid><orcidid>https://orcid.org/000000019149295X</orcidid></search><sort><creationdate>20160910</creationdate><title>A hybrid computational strategy to address WGS variant analysis in &gt;5000 samples</title><author>Huang, Zhuoyi ; Rustagi, Navin ; Veeraraghavan, Narayanan ; Carroll, Andrew ; Gibbs, Richard ; Boerwinkle, Eric ; Venkata, Manjunath Gorentla ; Yu, Fuli</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c621t-f073cc5169f90307e3ae4ccbdbae804afb1978246108d538b4413578637fc3d23</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2016</creationdate><topic>BASIC BIOLOGICAL SCIENCES</topic><topic>Big data</topic><topic>biochemistry &amp; molecular biology</topic><topic>biotechnology &amp; applied microbiology</topic><topic>cloud AWS</topic><topic>Databases, Genetic</topic><topic>DNA sequencing</topic><topic>ensemble calling</topic><topic>Genetic variation</topic><topic>Genome, Human</topic><topic>Genomics - methods</topic><topic>High-Throughput Nucleotide Sequencing - methods</topic><topic>Humans</topic><topic>Information management</topic><topic>joint calling</topic><topic>mathematical &amp; computational biology</topic><topic>Methodology</topic><topic>Nucleotide sequencing</topic><topic>scalable</topic><topic>SNV</topic><topic>supercomputer</topic><topic>variant calling</topic><topic>WGS</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Huang, Zhuoyi</creatorcontrib><creatorcontrib>Rustagi, Navin</creatorcontrib><creatorcontrib>Veeraraghavan, Narayanan</creatorcontrib><creatorcontrib>Carroll, Andrew</creatorcontrib><creatorcontrib>Gibbs, Richard</creatorcontrib><creatorcontrib>Boerwinkle, Eric</creatorcontrib><creatorcontrib>Venkata, Manjunath Gorentla</creatorcontrib><creatorcontrib>Yu, Fuli</creatorcontrib><creatorcontrib>Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Science</collection><collection>ProQuest Central (Corporate)</collection><collection>Biotechnology Research Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Health &amp; Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Medical Database (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest Pharma Collection</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Engineering Research Database</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>ProQuest Biological Science Collection</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Computing Database</collection><collection>Health &amp; Medical Collection (Alumni Edition)</collection><collection>Medical Database</collection><collection>Biological Science Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>ProQuest Central Basic</collection><collection>MEDLINE - Academic</collection><collection>OSTI.GOV - Hybrid</collection><collection>OSTI.GOV</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>BMC bioinformatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Huang, Zhuoyi</au><au>Rustagi, Navin</au><au>Veeraraghavan, Narayanan</au><au>Carroll, Andrew</au><au>Gibbs, Richard</au><au>Boerwinkle, Eric</au><au>Venkata, Manjunath Gorentla</au><au>Yu, Fuli</au><aucorp>Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)</aucorp><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A hybrid computational strategy to address WGS variant analysis in &gt;5000 samples</atitle><jtitle>BMC bioinformatics</jtitle><addtitle>BMC Bioinformatics</addtitle><date>2016-09-10</date><risdate>2016</risdate><volume>17</volume><issue>1</issue><spage>361</spage><epage>361</epage><pages>361-361</pages><artnum>361</artnum><issn>1471-2105</issn><eissn>1471-2105</eissn><abstract>The decreasing costs of sequencing are driving the need for cost effective and real time variant calling of whole genome sequencing data. The scale of these projects are far beyond the capacity of typical computing resources available with most research labs. Other infrastructures like the cloud AWS environment and supercomputers also have limitations due to which large scale joint variant calling becomes infeasible, and infrastructure specific variant calling strategies either fail to scale up to large datasets or abandon joint calling strategies. We present a high throughput framework including multiple variant callers for single nucleotide variant (SNV) calling, which leverages hybrid computing infrastructure consisting of cloud AWS, supercomputers and local high performance computing infrastructures. We present a novel binning approach for large scale joint variant calling and imputation which can scale up to over 10,000 samples while producing SNV callsets with high sensitivity and specificity. As a proof of principle, we present results of analysis on Cohorts for Heart And Aging Research in Genomic Epidemiology (CHARGE) WGS freeze 3 dataset in which joint calling, imputation and phasing of over 5300 whole genome samples was produced in under 6 weeks using four state-of-the-art callers. The callers used were SNPTools, GATK-HaplotypeCaller, GATK-UnifiedGenotyper and GotCloud. We used Amazon AWS, a 4000-core in-house cluster at Baylor College of Medicine, IBM power PC Blue BioU at Rice and Rhea at Oak Ridge National Laboratory (ORNL) for the computation. AWS was used for joint calling of 180 TB of BAM files, and ORNL and Rice supercomputers were used for the imputation and phasing step. All other steps were carried out on the local compute cluster. The entire operation used 5.2 million core hours and only transferred a total of 6 TB of data across the platforms. Even with increasing sizes of whole genome datasets, ensemble joint calling of SNVs for low coverage data can be accomplished in a scalable, cost effective and fast manner by using heterogeneous computing platforms without compromising on the quality of variants.</abstract><cop>England</cop><pub>BioMed Central Ltd</pub><pmid>27612449</pmid><doi>10.1186/s12859-016-1211-6</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0001-9149-295X</orcidid><orcidid>https://orcid.org/000000019149295X</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1471-2105
ispartof BMC bioinformatics, 2016-09, Vol.17 (1), p.361-361, Article 361
issn 1471-2105
1471-2105
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_5018196
source MEDLINE; DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; SpringerNature Journals; PubMed Central Open Access; PubMed Central; Springer Nature OA/Free Journals
subjects BASIC BIOLOGICAL SCIENCES
Big data
biochemistry & molecular biology
biotechnology & applied microbiology
cloud AWS
Databases, Genetic
DNA sequencing
ensemble calling
Genetic variation
Genome, Human
Genomics - methods
High-Throughput Nucleotide Sequencing - methods
Humans
Information management
joint calling
mathematical & computational biology
Methodology
Nucleotide sequencing
scalable
SNV
supercomputer
variant calling
WGS
title A hybrid computational strategy to address WGS variant analysis in >5000 samples
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-13T05%3A16%3A06IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20hybrid%20computational%20strategy%20to%20address%20WGS%20variant%20analysis%20in%20%3E5000%20samples&rft.jtitle=BMC%20bioinformatics&rft.au=Huang,%20Zhuoyi&rft.aucorp=Oak%20Ridge%20National%20Lab.%20(ORNL),%20Oak%20Ridge,%20TN%20(United%20States).%20Oak%20Ridge%20Leadership%20Computing%20Facility%20(OLCF)&rft.date=2016-09-10&rft.volume=17&rft.issue=1&rft.spage=361&rft.epage=361&rft.pages=361-361&rft.artnum=361&rft.issn=1471-2105&rft.eissn=1471-2105&rft_id=info:doi/10.1186/s12859-016-1211-6&rft_dat=%3Cgale_pubme%3EA465629003%3C/gale_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1824972514&rft_id=info:pmid/27612449&rft_galeid=A465629003&rfr_iscdi=true