Label-guided seed-chain-extend alignment on annotated De Bruijn graphs
Abstract Motivation Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contig...
Gespeichert in:
Veröffentlicht in: | Bioinformatics (Oxford, England) England), 2024-06, Vol.40 (Supplement_1), p.i337-i346 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | i346 |
---|---|
container_issue | Supplement_1 |
container_start_page | i337 |
container_title | Bioinformatics (Oxford, England) |
container_volume | 40 |
creator | Mustafa, Harun Karasikov, Mikhail Mansouri Ghiasi, Nika Rätsch, Gunnar Kahles, André |
description | Abstract
Motivation
Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy.
Results
We introduce a new scoring model, ‘multi-label alignment’ (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, ‘Label Change’ incorporates more informative global sample similarity into local scores. To improve connectivity, ‘Node Length Change’ dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%–66.8% and covering 45.5%–47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment.
Availability and implementation
The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla. |
doi_str_mv | 10.1093/bioinformatics/btae226 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_3073234474</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><oup_id>10.1093/bioinformatics/btae226</oup_id><sourcerecordid>3124438802</sourcerecordid><originalsourceid>FETCH-LOGICAL-c306t-fadf757e6f05b2090b68988af73158cd39fba3fa00bbe2937935e93569c22e623</originalsourceid><addsrcrecordid>eNqNkLFOwzAQhi0EoqXwClUkFpbQs504yQiFAlIlFpgjOzm3rhK72IkEb0-qlkowMZzuhu__dfoImVK4pVDwmTLOWO18KztThZnqJDImTsiYcpHFSU7p6fEGPiIXIWwAIIVUnJMRz4sEqEjGZLGUCpt41Zsa6ygg1nG1lsbG-NmhrSPZmJVt0XaRs5G01nWyG8AHjO59bzY2Wnm5XYdLcqZlE_DqsCfkffH4Nn-Ol69PL_O7ZVxxEF2sZa2zNEOhIVUMClAiL_Jc6ozTNK9qXmgluZYASiEreFbwFIcRRcUYCsYn5Gbfu_Xuo8fQla0JFTaNtOj6UHLIOONJkiUDev0H3bje2-G7klOWJDzPYVco9lTlXQgedbn1ppX-q6RQ7kyXv02XB9NDcHqo71WL9TH2o3YA6B5w_fa_pd9L3ZAC</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3124438802</pqid></control><display><type>article</type><title>Label-guided seed-chain-extend alignment on annotated De Bruijn graphs</title><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>Oxford Journals Open Access Collection</source><source>EZB-FREE-00999 freely available EZB journals</source><source>PubMed Central</source><source>Alma/SFX Local Collection</source><creator>Mustafa, Harun ; Karasikov, Mikhail ; Mansouri Ghiasi, Nika ; Rätsch, Gunnar ; Kahles, André</creator><creatorcontrib>Mustafa, Harun ; Karasikov, Mikhail ; Mansouri Ghiasi, Nika ; Rätsch, Gunnar ; Kahles, André</creatorcontrib><description>Abstract
Motivation
Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy.
Results
We introduce a new scoring model, ‘multi-label alignment’ (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, ‘Label Change’ incorporates more informative global sample similarity into local scores. To improve connectivity, ‘Node Length Change’ dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%–66.8% and covering 45.5%–47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment.
Availability and implementation
The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.</description><identifier>ISSN: 1367-4803</identifier><identifier>ISSN: 1367-4811</identifier><identifier>EISSN: 1367-4811</identifier><identifier>DOI: 10.1093/bioinformatics/btae226</identifier><identifier>PMID: 38940164</identifier><language>eng</language><publisher>England: Oxford University Press</publisher><subject>Algorithms ; Alignment ; Annotations ; Availability ; Biological effects ; Chains ; Computational Biology - methods ; Databases, Genetic ; Fragmentation ; Graph theory ; Graphs ; Labels ; Scoring models ; Sequence Alignment - methods ; Sequence Analysis, DNA - methods ; Software</subject><ispartof>Bioinformatics (Oxford, England), 2024-06, Vol.40 (Supplement_1), p.i337-i346</ispartof><rights>The Author(s) 2024. Published by Oxford University Press. 2024</rights><rights>The Author(s) 2024. Published by Oxford University Press.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c306t-fadf757e6f05b2090b68988af73158cd39fba3fa00bbe2937935e93569c22e623</cites><orcidid>0000-0002-3411-0692 ; 0000-0002-2125-6086 ; 0000-0001-6200-5972 ; 0000-0002-0833-0042</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,864,1603,27922,27923</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/38940164$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Mustafa, Harun</creatorcontrib><creatorcontrib>Karasikov, Mikhail</creatorcontrib><creatorcontrib>Mansouri Ghiasi, Nika</creatorcontrib><creatorcontrib>Rätsch, Gunnar</creatorcontrib><creatorcontrib>Kahles, André</creatorcontrib><title>Label-guided seed-chain-extend alignment on annotated De Bruijn graphs</title><title>Bioinformatics (Oxford, England)</title><addtitle>Bioinformatics</addtitle><description>Abstract
Motivation
Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy.
Results
We introduce a new scoring model, ‘multi-label alignment’ (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, ‘Label Change’ incorporates more informative global sample similarity into local scores. To improve connectivity, ‘Node Length Change’ dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%–66.8% and covering 45.5%–47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment.
Availability and implementation
The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.</description><subject>Algorithms</subject><subject>Alignment</subject><subject>Annotations</subject><subject>Availability</subject><subject>Biological effects</subject><subject>Chains</subject><subject>Computational Biology - methods</subject><subject>Databases, Genetic</subject><subject>Fragmentation</subject><subject>Graph theory</subject><subject>Graphs</subject><subject>Labels</subject><subject>Scoring models</subject><subject>Sequence Alignment - methods</subject><subject>Sequence Analysis, DNA - methods</subject><subject>Software</subject><issn>1367-4803</issn><issn>1367-4811</issn><issn>1367-4811</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>TOX</sourceid><sourceid>EIF</sourceid><recordid>eNqNkLFOwzAQhi0EoqXwClUkFpbQs504yQiFAlIlFpgjOzm3rhK72IkEb0-qlkowMZzuhu__dfoImVK4pVDwmTLOWO18KztThZnqJDImTsiYcpHFSU7p6fEGPiIXIWwAIIVUnJMRz4sEqEjGZLGUCpt41Zsa6ygg1nG1lsbG-NmhrSPZmJVt0XaRs5G01nWyG8AHjO59bzY2Wnm5XYdLcqZlE_DqsCfkffH4Nn-Ol69PL_O7ZVxxEF2sZa2zNEOhIVUMClAiL_Jc6ozTNK9qXmgluZYASiEreFbwFIcRRcUYCsYn5Gbfu_Xuo8fQla0JFTaNtOj6UHLIOONJkiUDev0H3bje2-G7klOWJDzPYVco9lTlXQgedbn1ppX-q6RQ7kyXv02XB9NDcHqo71WL9TH2o3YA6B5w_fa_pd9L3ZAC</recordid><startdate>20240628</startdate><enddate>20240628</enddate><creator>Mustafa, Harun</creator><creator>Karasikov, Mikhail</creator><creator>Mansouri Ghiasi, Nika</creator><creator>Rätsch, Gunnar</creator><creator>Kahles, André</creator><general>Oxford University Press</general><general>Oxford Publishing Limited (England)</general><scope>TOX</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QF</scope><scope>7QO</scope><scope>7QQ</scope><scope>7SC</scope><scope>7SE</scope><scope>7SP</scope><scope>7SR</scope><scope>7TA</scope><scope>7TB</scope><scope>7TM</scope><scope>7TO</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>H8D</scope><scope>H8G</scope><scope>H94</scope><scope>JG9</scope><scope>JQ2</scope><scope>K9.</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>P64</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-3411-0692</orcidid><orcidid>https://orcid.org/0000-0002-2125-6086</orcidid><orcidid>https://orcid.org/0000-0001-6200-5972</orcidid><orcidid>https://orcid.org/0000-0002-0833-0042</orcidid></search><sort><creationdate>20240628</creationdate><title>Label-guided seed-chain-extend alignment on annotated De Bruijn graphs</title><author>Mustafa, Harun ; Karasikov, Mikhail ; Mansouri Ghiasi, Nika ; Rätsch, Gunnar ; Kahles, André</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c306t-fadf757e6f05b2090b68988af73158cd39fba3fa00bbe2937935e93569c22e623</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Algorithms</topic><topic>Alignment</topic><topic>Annotations</topic><topic>Availability</topic><topic>Biological effects</topic><topic>Chains</topic><topic>Computational Biology - methods</topic><topic>Databases, Genetic</topic><topic>Fragmentation</topic><topic>Graph theory</topic><topic>Graphs</topic><topic>Labels</topic><topic>Scoring models</topic><topic>Sequence Alignment - methods</topic><topic>Sequence Analysis, DNA - methods</topic><topic>Software</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Mustafa, Harun</creatorcontrib><creatorcontrib>Karasikov, Mikhail</creatorcontrib><creatorcontrib>Mansouri Ghiasi, Nika</creatorcontrib><creatorcontrib>Rätsch, Gunnar</creatorcontrib><creatorcontrib>Kahles, André</creatorcontrib><collection>Oxford Journals Open Access Collection</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Aluminium Industry Abstracts</collection><collection>Biotechnology Research Abstracts</collection><collection>Ceramic Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Corrosion Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Materials Business File</collection><collection>Mechanical & Transportation Engineering Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>Oncogenes and Growth Factors Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology & Engineering</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>Copper Technical Reference Library</collection><collection>AIDS and Cancer Research Abstracts</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>Bioinformatics (Oxford, England)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Mustafa, Harun</au><au>Karasikov, Mikhail</au><au>Mansouri Ghiasi, Nika</au><au>Rätsch, Gunnar</au><au>Kahles, André</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Label-guided seed-chain-extend alignment on annotated De Bruijn graphs</atitle><jtitle>Bioinformatics (Oxford, England)</jtitle><addtitle>Bioinformatics</addtitle><date>2024-06-28</date><risdate>2024</risdate><volume>40</volume><issue>Supplement_1</issue><spage>i337</spage><epage>i346</epage><pages>i337-i346</pages><issn>1367-4803</issn><issn>1367-4811</issn><eissn>1367-4811</eissn><abstract>Abstract
Motivation
Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy.
Results
We introduce a new scoring model, ‘multi-label alignment’ (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, ‘Label Change’ incorporates more informative global sample similarity into local scores. To improve connectivity, ‘Node Length Change’ dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%–66.8% and covering 45.5%–47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment.
Availability and implementation
The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.</abstract><cop>England</cop><pub>Oxford University Press</pub><pmid>38940164</pmid><doi>10.1093/bioinformatics/btae226</doi><orcidid>https://orcid.org/0000-0002-3411-0692</orcidid><orcidid>https://orcid.org/0000-0002-2125-6086</orcidid><orcidid>https://orcid.org/0000-0001-6200-5972</orcidid><orcidid>https://orcid.org/0000-0002-0833-0042</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1367-4803 |
ispartof | Bioinformatics (Oxford, England), 2024-06, Vol.40 (Supplement_1), p.i337-i346 |
issn | 1367-4803 1367-4811 1367-4811 |
language | eng |
recordid | cdi_proquest_miscellaneous_3073234474 |
source | MEDLINE; DOAJ Directory of Open Access Journals; Oxford Journals Open Access Collection; EZB-FREE-00999 freely available EZB journals; PubMed Central; Alma/SFX Local Collection |
subjects | Algorithms Alignment Annotations Availability Biological effects Chains Computational Biology - methods Databases, Genetic Fragmentation Graph theory Graphs Labels Scoring models Sequence Alignment - methods Sequence Analysis, DNA - methods Software |
title | Label-guided seed-chain-extend alignment on annotated De Bruijn graphs |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-13T17%3A33%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Label-guided%20seed-chain-extend%20alignment%20on%20annotated%20De%20Bruijn%20graphs&rft.jtitle=Bioinformatics%20(Oxford,%20England)&rft.au=Mustafa,%20Harun&rft.date=2024-06-28&rft.volume=40&rft.issue=Supplement_1&rft.spage=i337&rft.epage=i346&rft.pages=i337-i346&rft.issn=1367-4803&rft.eissn=1367-4811&rft_id=info:doi/10.1093/bioinformatics/btae226&rft_dat=%3Cproquest_cross%3E3124438802%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3124438802&rft_id=info:pmid/38940164&rft_oup_id=10.1093/bioinformatics/btae226&rfr_iscdi=true |