Label-guided seed-chain-extend alignment on annotated De Bruijn graphs

Abstract Motivation Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contig...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Bioinformatics (Oxford, England) England), 2024-06, Vol.40 (Supplement_1), p.i337-i346
Hauptverfasser: Mustafa, Harun, Karasikov, Mikhail, Mansouri Ghiasi, Nika, Rätsch, Gunnar, Kahles, André
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page i346
container_issue Supplement_1
container_start_page i337
container_title Bioinformatics (Oxford, England)
container_volume 40
creator Mustafa, Harun
Karasikov, Mikhail
Mansouri Ghiasi, Nika
Rätsch, Gunnar
Kahles, André
description Abstract Motivation Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. Results We introduce a new scoring model, ‘multi-label alignment’ (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, ‘Label Change’ incorporates more informative global sample similarity into local scores. To improve connectivity, ‘Node Length Change’ dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%–66.8% and covering 45.5%–47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. Availability and implementation The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.
doi_str_mv 10.1093/bioinformatics/btae226
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_3073234474</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><oup_id>10.1093/bioinformatics/btae226</oup_id><sourcerecordid>3124438802</sourcerecordid><originalsourceid>FETCH-LOGICAL-c306t-fadf757e6f05b2090b68988af73158cd39fba3fa00bbe2937935e93569c22e623</originalsourceid><addsrcrecordid>eNqNkLFOwzAQhi0EoqXwClUkFpbQs504yQiFAlIlFpgjOzm3rhK72IkEb0-qlkowMZzuhu__dfoImVK4pVDwmTLOWO18KztThZnqJDImTsiYcpHFSU7p6fEGPiIXIWwAIIVUnJMRz4sEqEjGZLGUCpt41Zsa6ygg1nG1lsbG-NmhrSPZmJVt0XaRs5G01nWyG8AHjO59bzY2Wnm5XYdLcqZlE_DqsCfkffH4Nn-Ol69PL_O7ZVxxEF2sZa2zNEOhIVUMClAiL_Jc6ozTNK9qXmgluZYASiEreFbwFIcRRcUYCsYn5Gbfu_Xuo8fQla0JFTaNtOj6UHLIOONJkiUDev0H3bje2-G7klOWJDzPYVco9lTlXQgedbn1ppX-q6RQ7kyXv02XB9NDcHqo71WL9TH2o3YA6B5w_fa_pd9L3ZAC</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3124438802</pqid></control><display><type>article</type><title>Label-guided seed-chain-extend alignment on annotated De Bruijn graphs</title><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>Oxford Journals Open Access Collection</source><source>EZB-FREE-00999 freely available EZB journals</source><source>PubMed Central</source><source>Alma/SFX Local Collection</source><creator>Mustafa, Harun ; Karasikov, Mikhail ; Mansouri Ghiasi, Nika ; Rätsch, Gunnar ; Kahles, André</creator><creatorcontrib>Mustafa, Harun ; Karasikov, Mikhail ; Mansouri Ghiasi, Nika ; Rätsch, Gunnar ; Kahles, André</creatorcontrib><description>Abstract Motivation Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. Results We introduce a new scoring model, ‘multi-label alignment’ (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, ‘Label Change’ incorporates more informative global sample similarity into local scores. To improve connectivity, ‘Node Length Change’ dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%–66.8% and covering 45.5%–47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. Availability and implementation The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.</description><identifier>ISSN: 1367-4803</identifier><identifier>ISSN: 1367-4811</identifier><identifier>EISSN: 1367-4811</identifier><identifier>DOI: 10.1093/bioinformatics/btae226</identifier><identifier>PMID: 38940164</identifier><language>eng</language><publisher>England: Oxford University Press</publisher><subject>Algorithms ; Alignment ; Annotations ; Availability ; Biological effects ; Chains ; Computational Biology - methods ; Databases, Genetic ; Fragmentation ; Graph theory ; Graphs ; Labels ; Scoring models ; Sequence Alignment - methods ; Sequence Analysis, DNA - methods ; Software</subject><ispartof>Bioinformatics (Oxford, England), 2024-06, Vol.40 (Supplement_1), p.i337-i346</ispartof><rights>The Author(s) 2024. Published by Oxford University Press. 2024</rights><rights>The Author(s) 2024. Published by Oxford University Press.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c306t-fadf757e6f05b2090b68988af73158cd39fba3fa00bbe2937935e93569c22e623</cites><orcidid>0000-0002-3411-0692 ; 0000-0002-2125-6086 ; 0000-0001-6200-5972 ; 0000-0002-0833-0042</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,864,1603,27922,27923</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/38940164$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Mustafa, Harun</creatorcontrib><creatorcontrib>Karasikov, Mikhail</creatorcontrib><creatorcontrib>Mansouri Ghiasi, Nika</creatorcontrib><creatorcontrib>Rätsch, Gunnar</creatorcontrib><creatorcontrib>Kahles, André</creatorcontrib><title>Label-guided seed-chain-extend alignment on annotated De Bruijn graphs</title><title>Bioinformatics (Oxford, England)</title><addtitle>Bioinformatics</addtitle><description>Abstract Motivation Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. Results We introduce a new scoring model, ‘multi-label alignment’ (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, ‘Label Change’ incorporates more informative global sample similarity into local scores. To improve connectivity, ‘Node Length Change’ dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%–66.8% and covering 45.5%–47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. Availability and implementation The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.</description><subject>Algorithms</subject><subject>Alignment</subject><subject>Annotations</subject><subject>Availability</subject><subject>Biological effects</subject><subject>Chains</subject><subject>Computational Biology - methods</subject><subject>Databases, Genetic</subject><subject>Fragmentation</subject><subject>Graph theory</subject><subject>Graphs</subject><subject>Labels</subject><subject>Scoring models</subject><subject>Sequence Alignment - methods</subject><subject>Sequence Analysis, DNA - methods</subject><subject>Software</subject><issn>1367-4803</issn><issn>1367-4811</issn><issn>1367-4811</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>TOX</sourceid><sourceid>EIF</sourceid><recordid>eNqNkLFOwzAQhi0EoqXwClUkFpbQs504yQiFAlIlFpgjOzm3rhK72IkEb0-qlkowMZzuhu__dfoImVK4pVDwmTLOWO18KztThZnqJDImTsiYcpHFSU7p6fEGPiIXIWwAIIVUnJMRz4sEqEjGZLGUCpt41Zsa6ygg1nG1lsbG-NmhrSPZmJVt0XaRs5G01nWyG8AHjO59bzY2Wnm5XYdLcqZlE_DqsCfkffH4Nn-Ol69PL_O7ZVxxEF2sZa2zNEOhIVUMClAiL_Jc6ozTNK9qXmgluZYASiEreFbwFIcRRcUYCsYn5Gbfu_Xuo8fQla0JFTaNtOj6UHLIOONJkiUDev0H3bje2-G7klOWJDzPYVco9lTlXQgedbn1ppX-q6RQ7kyXv02XB9NDcHqo71WL9TH2o3YA6B5w_fa_pd9L3ZAC</recordid><startdate>20240628</startdate><enddate>20240628</enddate><creator>Mustafa, Harun</creator><creator>Karasikov, Mikhail</creator><creator>Mansouri Ghiasi, Nika</creator><creator>Rätsch, Gunnar</creator><creator>Kahles, André</creator><general>Oxford University Press</general><general>Oxford Publishing Limited (England)</general><scope>TOX</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QF</scope><scope>7QO</scope><scope>7QQ</scope><scope>7SC</scope><scope>7SE</scope><scope>7SP</scope><scope>7SR</scope><scope>7TA</scope><scope>7TB</scope><scope>7TM</scope><scope>7TO</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>H8D</scope><scope>H8G</scope><scope>H94</scope><scope>JG9</scope><scope>JQ2</scope><scope>K9.</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>P64</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-3411-0692</orcidid><orcidid>https://orcid.org/0000-0002-2125-6086</orcidid><orcidid>https://orcid.org/0000-0001-6200-5972</orcidid><orcidid>https://orcid.org/0000-0002-0833-0042</orcidid></search><sort><creationdate>20240628</creationdate><title>Label-guided seed-chain-extend alignment on annotated De Bruijn graphs</title><author>Mustafa, Harun ; Karasikov, Mikhail ; Mansouri Ghiasi, Nika ; Rätsch, Gunnar ; Kahles, André</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c306t-fadf757e6f05b2090b68988af73158cd39fba3fa00bbe2937935e93569c22e623</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Algorithms</topic><topic>Alignment</topic><topic>Annotations</topic><topic>Availability</topic><topic>Biological effects</topic><topic>Chains</topic><topic>Computational Biology - methods</topic><topic>Databases, Genetic</topic><topic>Fragmentation</topic><topic>Graph theory</topic><topic>Graphs</topic><topic>Labels</topic><topic>Scoring models</topic><topic>Sequence Alignment - methods</topic><topic>Sequence Analysis, DNA - methods</topic><topic>Software</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Mustafa, Harun</creatorcontrib><creatorcontrib>Karasikov, Mikhail</creatorcontrib><creatorcontrib>Mansouri Ghiasi, Nika</creatorcontrib><creatorcontrib>Rätsch, Gunnar</creatorcontrib><creatorcontrib>Kahles, André</creatorcontrib><collection>Oxford Journals Open Access Collection</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Aluminium Industry Abstracts</collection><collection>Biotechnology Research Abstracts</collection><collection>Ceramic Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Corrosion Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Materials Business File</collection><collection>Mechanical &amp; Transportation Engineering Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>Oncogenes and Growth Factors Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology &amp; Engineering</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>Copper Technical Reference Library</collection><collection>AIDS and Cancer Research Abstracts</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>Bioinformatics (Oxford, England)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Mustafa, Harun</au><au>Karasikov, Mikhail</au><au>Mansouri Ghiasi, Nika</au><au>Rätsch, Gunnar</au><au>Kahles, André</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Label-guided seed-chain-extend alignment on annotated De Bruijn graphs</atitle><jtitle>Bioinformatics (Oxford, England)</jtitle><addtitle>Bioinformatics</addtitle><date>2024-06-28</date><risdate>2024</risdate><volume>40</volume><issue>Supplement_1</issue><spage>i337</spage><epage>i346</epage><pages>i337-i346</pages><issn>1367-4803</issn><issn>1367-4811</issn><eissn>1367-4811</eissn><abstract>Abstract Motivation Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. Results We introduce a new scoring model, ‘multi-label alignment’ (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, ‘Label Change’ incorporates more informative global sample similarity into local scores. To improve connectivity, ‘Node Length Change’ dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%–66.8% and covering 45.5%–47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. Availability and implementation The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.</abstract><cop>England</cop><pub>Oxford University Press</pub><pmid>38940164</pmid><doi>10.1093/bioinformatics/btae226</doi><orcidid>https://orcid.org/0000-0002-3411-0692</orcidid><orcidid>https://orcid.org/0000-0002-2125-6086</orcidid><orcidid>https://orcid.org/0000-0001-6200-5972</orcidid><orcidid>https://orcid.org/0000-0002-0833-0042</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1367-4803
ispartof Bioinformatics (Oxford, England), 2024-06, Vol.40 (Supplement_1), p.i337-i346
issn 1367-4803
1367-4811
1367-4811
language eng
recordid cdi_proquest_miscellaneous_3073234474
source MEDLINE; DOAJ Directory of Open Access Journals; Oxford Journals Open Access Collection; EZB-FREE-00999 freely available EZB journals; PubMed Central; Alma/SFX Local Collection
subjects Algorithms
Alignment
Annotations
Availability
Biological effects
Chains
Computational Biology - methods
Databases, Genetic
Fragmentation
Graph theory
Graphs
Labels
Scoring models
Sequence Alignment - methods
Sequence Analysis, DNA - methods
Software
title Label-guided seed-chain-extend alignment on annotated De Bruijn graphs
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-13T17%3A33%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Label-guided%20seed-chain-extend%20alignment%20on%20annotated%20De%20Bruijn%20graphs&rft.jtitle=Bioinformatics%20(Oxford,%20England)&rft.au=Mustafa,%20Harun&rft.date=2024-06-28&rft.volume=40&rft.issue=Supplement_1&rft.spage=i337&rft.epage=i346&rft.pages=i337-i346&rft.issn=1367-4803&rft.eissn=1367-4811&rft_id=info:doi/10.1093/bioinformatics/btae226&rft_dat=%3Cproquest_cross%3E3124438802%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3124438802&rft_id=info:pmid/38940164&rft_oup_id=10.1093/bioinformatics/btae226&rfr_iscdi=true