Label-guided seed-chain-extend alignment on annotated De Bruijn graphs
Abstract Motivation Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contig...
Gespeichert in:
Veröffentlicht in: | Bioinformatics (Oxford, England) England), 2024-06, Vol.40 (Supplement_1), p.i337-i346 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Abstract
Motivation
Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy.
Results
We introduce a new scoring model, ‘multi-label alignment’ (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, ‘Label Change’ incorporates more informative global sample similarity into local scores. To improve connectivity, ‘Node Length Change’ dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%–66.8% and covering 45.5%–47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment.
Availability and implementation
The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla. |
---|---|
ISSN: | 1367-4803 1367-4811 1367-4811 |
DOI: | 10.1093/bioinformatics/btae226 |