Halvade: scalable sequence analysis with MapReduce

Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine. We present Halvade, a framework that enables sequencing pipelin...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Bioinformatics 2015-08, Vol.31 (15), p.2482-2488
Hauptverfasser: Decap, Dries, Reumers, Joke, Herzeel, Charlotte, Costanza, Pascal, Fostier, Jan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 2488
container_issue 15
container_start_page 2482
container_title Bioinformatics
container_volume 31
creator Decap, Dries
Reumers, Joke
Herzeel, Charlotte
Costanza, Pascal
Fostier, Jan
description Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine. We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50× coverage) in
doi_str_mv 10.1093/bioinformatics/btv179
format Article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_4514927</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1709189500</sourcerecordid><originalsourceid>FETCH-LOGICAL-c477t-e4ddd0670225c09475f87a615aa5eef91885be0365fbba06a2d6ee53798819b33</originalsourceid><addsrcrecordid>eNqNkc1Lw0AQxRdR_Kj-CUqOXmpnk-yXB0GKWqEiiJ6XyWaiK2lSs2ml_72prcWe9DQD83uPeTzGTjlccDDJIPO1r4q6mWDrXRhk7Zwrs8MOeSJVP9Wc7252SA7YUQjvACBAyH12EAvNDSh9yOIRlnPM6TIKDkvMSooCfcyochRhheUi-BB9-vYtesDpE-UzR8dsr8Ay0Ml69tjL7c3zcNQfP97dD6_HfZcq1fYpzfMcpII4Fg5MqkShFUouEAVRYbjWIiNIpCiyDEFinEsikSiju-eyJOmxq5XvdJZNKHdUtQ2Wdtr4CTYLW6O325fKv9nXem5TwVMTq87gfG3Q1F2k0NqJD47KEiuqZ8FyxbXpWC7_gYJZwgB_o9KYGGL97SpWqGvqEBoqNs9zsMsW7XaLdtVipzv7nXyj-qkt-QIaK53w</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1699202816</pqid></control><display><type>article</type><title>Halvade: scalable sequence analysis with MapReduce</title><source>Oxford Journals Open Access Collection</source><source>MEDLINE</source><source>PMC (PubMed Central)</source><source>EZB-FREE-00999 freely available EZB journals</source><source>Alma/SFX Local Collection</source><creator>Decap, Dries ; Reumers, Joke ; Herzeel, Charlotte ; Costanza, Pascal ; Fostier, Jan</creator><creatorcontrib>Decap, Dries ; Reumers, Joke ; Herzeel, Charlotte ; Costanza, Pascal ; Fostier, Jan</creatorcontrib><description>Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine. We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50× coverage) in &lt;3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading.</description><identifier>ISSN: 1367-4803</identifier><identifier>EISSN: 1367-4811</identifier><identifier>EISSN: 1460-2059</identifier><identifier>DOI: 10.1093/bioinformatics/btv179</identifier><identifier>PMID: 25819078</identifier><language>eng</language><publisher>England: Oxford University Press</publisher><subject>API ; Bioinformatics ; Clusters ; Gene sequencing ; Genome, Human ; Genomes ; Human ; Humans ; Original Papers ; Pipelining (computers) ; Running ; Sequence Analysis, DNA - methods ; Software</subject><ispartof>Bioinformatics, 2015-08, Vol.31 (15), p.2482-2488</ispartof><rights>The Author 2015. Published by Oxford University Press.</rights><rights>The Author 2015. Published by Oxford University Press. 2015</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c477t-e4ddd0670225c09475f87a615aa5eef91885be0365fbba06a2d6ee53798819b33</citedby><cites>FETCH-LOGICAL-c477t-e4ddd0670225c09475f87a615aa5eef91885be0365fbba06a2d6ee53798819b33</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4514927/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4514927/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,881,27901,27902,53766,53768</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/25819078$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Decap, Dries</creatorcontrib><creatorcontrib>Reumers, Joke</creatorcontrib><creatorcontrib>Herzeel, Charlotte</creatorcontrib><creatorcontrib>Costanza, Pascal</creatorcontrib><creatorcontrib>Fostier, Jan</creatorcontrib><title>Halvade: scalable sequence analysis with MapReduce</title><title>Bioinformatics</title><addtitle>Bioinformatics</addtitle><description>Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine. We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50× coverage) in &lt;3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading.</description><subject>API</subject><subject>Bioinformatics</subject><subject>Clusters</subject><subject>Gene sequencing</subject><subject>Genome, Human</subject><subject>Genomes</subject><subject>Human</subject><subject>Humans</subject><subject>Original Papers</subject><subject>Pipelining (computers)</subject><subject>Running</subject><subject>Sequence Analysis, DNA - methods</subject><subject>Software</subject><issn>1367-4803</issn><issn>1367-4811</issn><issn>1460-2059</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2015</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNqNkc1Lw0AQxRdR_Kj-CUqOXmpnk-yXB0GKWqEiiJ6XyWaiK2lSs2ml_72prcWe9DQD83uPeTzGTjlccDDJIPO1r4q6mWDrXRhk7Zwrs8MOeSJVP9Wc7252SA7YUQjvACBAyH12EAvNDSh9yOIRlnPM6TIKDkvMSooCfcyochRhheUi-BB9-vYtesDpE-UzR8dsr8Ay0Ml69tjL7c3zcNQfP97dD6_HfZcq1fYpzfMcpII4Fg5MqkShFUouEAVRYbjWIiNIpCiyDEFinEsikSiju-eyJOmxq5XvdJZNKHdUtQ2Wdtr4CTYLW6O325fKv9nXem5TwVMTq87gfG3Q1F2k0NqJD47KEiuqZ8FyxbXpWC7_gYJZwgB_o9KYGGL97SpWqGvqEBoqNs9zsMsW7XaLdtVipzv7nXyj-qkt-QIaK53w</recordid><startdate>20150801</startdate><enddate>20150801</enddate><creator>Decap, Dries</creator><creator>Reumers, Joke</creator><creator>Herzeel, Charlotte</creator><creator>Costanza, Pascal</creator><creator>Fostier, Jan</creator><general>Oxford University Press</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>7QO</scope><scope>7TM</scope><scope>8FD</scope><scope>FR3</scope><scope>P64</scope><scope>7SC</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>5PM</scope></search><sort><creationdate>20150801</creationdate><title>Halvade: scalable sequence analysis with MapReduce</title><author>Decap, Dries ; Reumers, Joke ; Herzeel, Charlotte ; Costanza, Pascal ; Fostier, Jan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c477t-e4ddd0670225c09475f87a615aa5eef91885be0365fbba06a2d6ee53798819b33</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2015</creationdate><topic>API</topic><topic>Bioinformatics</topic><topic>Clusters</topic><topic>Gene sequencing</topic><topic>Genome, Human</topic><topic>Genomes</topic><topic>Human</topic><topic>Humans</topic><topic>Original Papers</topic><topic>Pipelining (computers)</topic><topic>Running</topic><topic>Sequence Analysis, DNA - methods</topic><topic>Software</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Decap, Dries</creatorcontrib><creatorcontrib>Reumers, Joke</creatorcontrib><creatorcontrib>Herzeel, Charlotte</creatorcontrib><creatorcontrib>Costanza, Pascal</creatorcontrib><creatorcontrib>Fostier, Jan</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>Biotechnology Research Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Bioinformatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Decap, Dries</au><au>Reumers, Joke</au><au>Herzeel, Charlotte</au><au>Costanza, Pascal</au><au>Fostier, Jan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Halvade: scalable sequence analysis with MapReduce</atitle><jtitle>Bioinformatics</jtitle><addtitle>Bioinformatics</addtitle><date>2015-08-01</date><risdate>2015</risdate><volume>31</volume><issue>15</issue><spage>2482</spage><epage>2488</epage><pages>2482-2488</pages><issn>1367-4803</issn><eissn>1367-4811</eissn><eissn>1460-2059</eissn><abstract>Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine. We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50× coverage) in &lt;3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading.</abstract><cop>England</cop><pub>Oxford University Press</pub><pmid>25819078</pmid><doi>10.1093/bioinformatics/btv179</doi><tpages>7</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1367-4803
ispartof Bioinformatics, 2015-08, Vol.31 (15), p.2482-2488
issn 1367-4803
1367-4811
1460-2059
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_4514927
source Oxford Journals Open Access Collection; MEDLINE; PMC (PubMed Central); EZB-FREE-00999 freely available EZB journals; Alma/SFX Local Collection
subjects API
Bioinformatics
Clusters
Gene sequencing
Genome, Human
Genomes
Human
Humans
Original Papers
Pipelining (computers)
Running
Sequence Analysis, DNA - methods
Software
title Halvade: scalable sequence analysis with MapReduce
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-01T15%3A16%3A19IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Halvade:%20scalable%20sequence%20analysis%20with%20MapReduce&rft.jtitle=Bioinformatics&rft.au=Decap,%20Dries&rft.date=2015-08-01&rft.volume=31&rft.issue=15&rft.spage=2482&rft.epage=2488&rft.pages=2482-2488&rft.issn=1367-4803&rft.eissn=1367-4811&rft_id=info:doi/10.1093/bioinformatics/btv179&rft_dat=%3Cproquest_pubme%3E1709189500%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1699202816&rft_id=info:pmid/25819078&rfr_iscdi=true