Disentangling cobionts and contamination in long-read genomic data using sequence composition
Abstract The recent acceleration in genome sequencing targeting previously unexplored parts of the tree of life presents computational challenges. Samples collected from the wild often contain sequences from several organisms, including the target, its cobionts, and contaminants. Effective methods a...
Gespeichert in:
Veröffentlicht in: | G3 : genes - genomes - genetics 2024-11, Vol.14 (11) |
---|---|
1. Verfasser: | |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Abstract
The recent acceleration in genome sequencing targeting previously unexplored parts of the tree of life presents computational challenges. Samples collected from the wild often contain sequences from several organisms, including the target, its cobionts, and contaminants. Effective methods are therefore needed to separate sequences. Though advances in sequencing technology make this task easier, it remains difficult to taxonomically assign sequences from eukaryotic taxa that are not well represented in databases. Therefore, reference-based methods alone are insufficient. Here, I examine how we can take advantage of differences in sequence composition between organisms to identify symbionts, parasites, and contaminants in samples, with minimal reliance on reference data. To this end, I explore data from the Darwin Tree of Life project, including hundreds of high-quality HiFi read sets from insects. Visualizing two-dimensional representations of read tetranucleotide composition learned by a variational autoencoder can reveal distinct components of a sample. Annotating the embeddings with additional information, such as coding density, estimated coverage, or taxonomic labels allows rapid assessment of the contents of a dataset. The approach scales to millions of sequences, making it possible to explore unassembled read sets, even for large genomes. Combined with interactive visualization tools, it allows a large fraction of cobionts reported by reference-based screening to be identified. Crucially, it also facilitates retrieving genomes for which suitable reference data are absent.Samples collected for genomic sequencing often contain genetic material from several organisms. Determining if a given sequence represents contamination can be difficult without suitable reference data. However, sequence composition can differ vastly between genomes. This work explores how 2D representations of composition can help separate genomic long reads from different sources. A Variational Autoencoder (a generative model based on a neural network) provides an effective framework for embedding millions of reads and identifying sequences from parasites, symbionts, and contaminants. |
---|---|
ISSN: | 2160-1836 2160-1836 |
DOI: | 10.1093/g3journal/jkae187 |