Label-guided seed-chain-extend alignment on annotated De Bruijn graphs

Abstract Motivation Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contig...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Bioinformatics (Oxford, England) England), 2024-06, Vol.40 (Supplement_1), p.i337-i346
Hauptverfasser:	Mustafa, Harun, Karasikov, Mikhail, Mansouri Ghiasi, Nika, Rätsch, Gunnar, Kahles, André
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Alignment Annotations Availability Biological effects Chains Computational Biology - methods Databases, Genetic Fragmentation Graph theory Graphs Labels Scoring models Sequence Alignment - methods Sequence Analysis, DNA - methods Software
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	i346
container_issue	Supplement_1
container_start_page	i337
container_title	Bioinformatics (Oxford, England)
container_volume	40
creator	Mustafa, Harun Karasikov, Mikhail Mansouri Ghiasi, Nika Rätsch, Gunnar Kahles, André
description	Abstract Motivation Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. Results We introduce a new scoring model, ‘multi-label alignment’ (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, ‘Label Change’ incorporates more informative global sample similarity into local scores. To improve connectivity, ‘Node Length Change’ dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%–66.8% and covering 45.5%–47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. Availability and implementation The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.
doi_str_mv	10.1093/bioinformatics/btae226
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_3073234474</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><oup_id>10.1093/bioinformatics/btae226</oup_id><sourcerecordid>3124438802</sourcerecordid><originalsourceid>FETCH-LOGICAL-c306t-fadf757e6f05b2090b68988af73158cd39fba3fa00bbe2937935e93569c22e623</originalsourceid><addsrcrecordid>eNqNkLFOwzAQhi0EoqXwClUkFpbQs504yQiFAlIlFpgjOzm3rhK72IkEb0-qlkowMZzuhu__dfoImVK4pVDwmTLOWO18KztThZnqJDImTsiYcpHFSU7p6fEGPiIXIWwAIIVUnJMRz4sEqEjGZLGUCpt41Zsa6ygg1nG1lsbG-NmhrSPZmJVt0XaRs5G01nWyG8AHjO59bzY2Wnm5XYdLcqZlE_DqsCfkffH4Nn-Ol69PL_O7ZVxxEF2sZa2zNEOhIVUMClAiL_Jc6ozTNK9qXmgluZYASiEreFbwFIcRRcUYCsYn5Gbfu_Xuo8fQla0JFTaNtOj6UHLIOONJkiUDev0H3bje2-G7klOWJDzPYVco9lTlXQgedbn1ppX-q6RQ7kyXv02XB9NDcHqo71WL9TH2o3YA6B5w_fa_pd9L3ZAC</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3124438802</pqid></control><display><type>article</type><title>Label-guided seed-chain-extend alignment on annotated De Bruijn graphs</title><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>Oxford Journals Open Access Collection</source><source>EZB-FREE-00999 freely available EZB journals</source><source>PubMed Central</source><source>Alma/SFX Local Collection</source><creator>Mustafa, Harun ; Karasikov, Mikhail ; Mansouri Ghiasi, Nika ; Rätsch, Gunnar ; Kahles, André</creator><creatorcontrib>Mustafa, Harun ; Karasikov, Mikhail ; Mansouri Ghiasi, Nika ; Rätsch, Gunnar ; Kahles, André</creatorcontrib><description>Abstract Motivation Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. Results We introduce a new scoring model, ‘multi-label alignment’ (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, ‘Label Change’ incorporates more informative global sample similarity into local scores. To improve connectivity, ‘Node Length Change’ dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%–66.8% and covering 45.5%–47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. Availability and implementation The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.</description><identifier>ISSN: 1367-4803</identifier><identifier>ISSN: 1367-4811</identifier><identifier>EISSN: 1367-4811</identifier><identifier>DOI: 10.1093/bioinformatics/btae226</identifier><identifier>PMID: 38940164</identifier><language>eng</language><publisher>England: Oxford University Press</publisher><subject>Algorithms ; Alignment ; Annotations ; Availability ; Biological effects ; Chains ; Computational Biology - methods ; Databases, Genetic ; Fragmentation ; Graph theory ; Graphs ; Labels ; Scoring models ; Sequence Alignment - methods ; Sequence Analysis, DNA - methods ; Software</subject><ispartof>Bioinformatics (Oxford, England), 2024-06, Vol.40 (Supplement_1), p.i337-i346</ispartof><rights>The Author(s) 2024. Published by Oxford University Press. 2024</rights><rights>The Author(s) 2024. Published by Oxford University Press.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c306t-fadf757e6f05b2090b68988af73158cd39fba3fa00bbe2937935e93569c22e623</cites><orcidid>0000-0002-3411-0692 ; 0000-0002-2125-6086 ; 0000-0001-6200-5972 ; 0000-0002-0833-0042</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,864,1603,27922,27923</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/38940164$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Mustafa, Harun</creatorcontrib><creatorcontrib>Karasikov, Mikhail</creatorcontrib><creatorcontrib>Mansouri Ghiasi, Nika</creatorcontrib><creatorcontrib>Rätsch, Gunnar</creatorcontrib><creatorcontrib>Kahles, André</creatorcontrib><title>Label-guided seed-chain-extend alignment on annotated De Bruijn graphs</title><title>Bioinformatics (Oxford, England)</title><addtitle>Bioinformatics</addtitle><description>Abstract Motivation Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. Results We introduce a new scoring model, ‘multi-label alignment’ (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, ‘Label Change’ incorporates more informative global sample similarity into local scores. To improve connectivity, ‘Node Length Change’ dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%–66.8% and covering 45.5%–47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. Availability and implementation The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.</description><subject>Algorithms</subject><subject>Alignment</subject><subject>Annotations</subject><subject>Availability</subject><subject>Biological effects</subject><subject>Chains</subject><subject>Computational Biology - methods</subject><subject>Databases, Genetic</subject><subject>Fragmentation</subject><subject>Graph theory</subject><subject>Graphs</subject><subject>Labels</subject><subject>Scoring models</subject><subject>Sequence Alignment - methods</subject><subject>Sequence Analysis, DNA - methods</subject><subject>Software</subject><issn>1367-4803</issn><issn>1367-4811</issn><issn>1367-4811</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>TOX</sourceid><sourceid>EIF</sourceid><recordid>eNqNkLFOwzAQhi0EoqXwClUkFpbQs504yQiFAlIlFpgjOzm3rhK72IkEb0-qlkowMZzuhu__dfoImVK4pVDwmTLOWO18KztThZnqJDImTsiYcpHFSU7p6fEGPiIXIWwAIIVUnJMRz4sEqEjGZLGUCpt41Zsa6ygg1nG1lsbG-NmhrSPZmJVt0XaRs5G01nWyG8AHjO59bzY2Wnm5XYdLcqZlE_DqsCfkffH4Nn-Ol69PL_O7ZVxxEF2sZa2zNEOhIVUMClAiL_Jc6ozTNK9qXmgluZYASiEreFbwFIcRRcUYCsYn5Gbfu_Xuo8fQla0JFTaNtOj6UHLIOONJkiUDev0H3bje2-G7klOWJDzPYVco9lTlXQgedbn1ppX-q6RQ7kyXv02XB9NDcHqo71WL9TH2o3YA6B5w_fa_pd9L3ZAC</recordid><startdate>20240628</startdate><enddate>20240628</enddate><creator>Mustafa, Harun</creator><creator>Karasikov, Mikhail</creator><creator>Mansouri Ghiasi, Nika</creator><creator>Rätsch, Gunnar</creator><creator>Kahles, André</creator><general>Oxford University Press</general><general>Oxford Publishing Limited (England)</general><scope>TOX</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QF</scope><scope>7QO</scope><scope>7QQ</scope><scope>7SC</scope><scope>7SE</scope><scope>7SP</scope><scope>7SR</scope><scope>7TA</scope><scope>7TB</scope><scope>7TM</scope><scope>7TO</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>H8D</scope><scope>H8G</scope><scope>H94</scope><scope>JG9</scope><scope>JQ2</scope><scope>K9.</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>P64</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-3411-0692</orcidid><orcidid>https://orcid.org/0000-0002-2125-6086</orcidid><orcidid>https://orcid.org/0000-0001-6200-5972</orcidid><orcidid>https://orcid.org/0000-0002-0833-0042</orcidid></search><sort><creationdate>20240628</creationdate><title>Label-guided seed-chain-extend alignment on annotated De Bruijn graphs</title><author>Mustafa, Harun ; Karasikov, Mikhail ; Mansouri Ghiasi, Nika ; Rätsch, Gunnar ; Kahles, André</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c306t-fadf757e6f05b2090b68988af73158cd39fba3fa00bbe2937935e93569c22e623</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Algorithms</topic><topic>Alignment</topic><topic>Annotations</topic><topic>Availability</topic><topic>Biological effects</topic><topic>Chains</topic><topic>Computational Biology - methods</topic><topic>Databases, Genetic</topic><topic>Fragmentation</topic><topic>Graph theory</topic><topic>Graphs</topic><topic>Labels</topic><topic>Scoring models</topic><topic>Sequence Alignment - methods</topic><topic>Sequence Analysis, DNA - methods</topic><topic>Software</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Mustafa, Harun</creatorcontrib><creatorcontrib>Karasikov, Mikhail</creatorcontrib><creatorcontrib>Mansouri Ghiasi, Nika</creatorcontrib><creatorcontrib>Rätsch, Gunnar</creatorcontrib><creatorcontrib>Kahles, André</creatorcontrib><collection>Oxford Journals Open Access Collection</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Aluminium Industry Abstracts</collection><collection>Biotechnology Research Abstracts</collection><collection>Ceramic Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Corrosion Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Materials Business File</collection><collection>Mechanical & Transportation Engineering Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>Oncogenes and Growth Factors Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology & Engineering</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>Copper Technical Reference Library</collection><collection>AIDS and Cancer Research Abstracts</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>Bioinformatics (Oxford, England)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Mustafa, Harun</au><au>Karasikov, Mikhail</au><au>Mansouri Ghiasi, Nika</au><au>Rätsch, Gunnar</au><au>Kahles, André</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Label-guided seed-chain-extend alignment on annotated De Bruijn graphs</atitle><jtitle>Bioinformatics (Oxford, England)</jtitle><addtitle>Bioinformatics</addtitle><date>2024-06-28</date><risdate>2024</risdate><volume>40</volume><issue>Supplement_1</issue><spage>i337</spage><epage>i346</epage><pages>i337-i346</pages><issn>1367-4803</issn><issn>1367-4811</issn><eissn>1367-4811</eissn><abstract>Abstract Motivation Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. Results We introduce a new scoring model, ‘multi-label alignment’ (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, ‘Label Change’ incorporates more informative global sample similarity into local scores. To improve connectivity, ‘Node Length Change’ dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%–66.8% and covering 45.5%–47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. Availability and implementation The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.</abstract><cop>England</cop><pub>Oxford University Press</pub><pmid>38940164</pmid><doi>10.1093/bioinformatics/btae226</doi><orcidid>https://orcid.org/0000-0002-3411-0692</orcidid><orcidid>https://orcid.org/0000-0002-2125-6086</orcidid><orcidid>https://orcid.org/0000-0001-6200-5972</orcidid><orcidid>https://orcid.org/0000-0002-0833-0042</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1367-4803
ispartof	Bioinformatics (Oxford, England), 2024-06, Vol.40 (Supplement_1), p.i337-i346
issn	1367-4803 1367-4811 1367-4811
language	eng
recordid	cdi_proquest_miscellaneous_3073234474
source	MEDLINE; DOAJ Directory of Open Access Journals; Oxford Journals Open Access Collection; EZB-FREE-00999 freely available EZB journals; PubMed Central; Alma/SFX Local Collection
subjects	Algorithms Alignment Annotations Availability Biological effects Chains Computational Biology - methods Databases, Genetic Fragmentation Graph theory Graphs Labels Scoring models Sequence Alignment - methods Sequence Analysis, DNA - methods Software
title	Label-guided seed-chain-extend alignment on annotated De Bruijn graphs
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-13T17%3A33%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Label-guided%20seed-chain-extend%20alignment%20on%20annotated%20De%20Bruijn%20graphs&rft.jtitle=Bioinformatics%20(Oxford,%20England)&rft.au=Mustafa,%20Harun&rft.date=2024-06-28&rft.volume=40&rft.issue=Supplement_1&rft.spage=i337&rft.epage=i346&rft.pages=i337-i346&rft.issn=1367-4803&rft.eissn=1367-4811&rft_id=info:doi/10.1093/bioinformatics/btae226&rft_dat=%3Cproquest_cross%3E3124438802%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3124438802&rft_id=info:pmid/38940164&rft_oup_id=10.1093/bioinformatics/btae226&rfr_iscdi=true