Fine-scale differentiation between Bacillus anthracis and Bacillus cereus group signatures in metagenome shotgun data

It is possible to detect bacterial species in shotgun metagenome datasets through the presence of only a few sequence reads. However, false positive results can arise, as was the case in the initial findings of a recent New York City subway metagenome project. False positives are especially likely w...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:PeerJ (San Francisco, CA) CA), 2018-08, Vol.6, p.e5515-e5515, Article e5515
Hauptverfasser: Petit Iii, Robert A, Hogan, James M, Ezewudo, Matthew N, Joseph, Sandeep J, Read, Timothy D
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:It is possible to detect bacterial species in shotgun metagenome datasets through the presence of only a few sequence reads. However, false positive results can arise, as was the case in the initial findings of a recent New York City subway metagenome project. False positives are especially likely when two closely related are present in the same sample. , the etiologic agent of anthrax, is a high-consequence pathogen that shares >99% average nucleotide identity with group (BCerG) genomes. Our goal was to create an analysis tool that used k-mers to detect incorporating information about the coverage of BCerG in the metagenome sample. Using public complete genome sequence datasets, we identified a set of 31-mer signatures that differentiated from other members of the group (BCerG), and another set which differentiated BCerG genomes (including ) from other strains. We also created a set of 31-mers for detecting the lethal factor gene, the key genetic diagnostic of the presence of anthrax-causing bacteria. We created synthetic sequence datasets based on existing genomes to test the accuracy of a k-mer based detection model. We found 239,503 -specific 31-mers (the ), 10,183 BCerG 31-mers (the ), and 2,617 lethal factor k-mers (the set). We showed that false positive k-mers-which arise from random sequencing errors-are observable at high genome coverages of . We also showed that there is a "gray zone" below 0.184× coverage of the genome sequence, in which we cannot expect with high probability to identify lethal factor k-mers. We created a linear regression model to differentiate the presence of -like chromosomes from sequencing errors given the BCerG background coverage. We showed that while shotgun datasets from the New York City subway metagenome project had no matches to k-mers and hence were negative for , some samples showed evidence of strains very closely related to the pathogen. This work shows how extensive libraries of complete genomes can be used to create organism-specific signatures to help interpret metagenomes. We contrast "specialist" approaches to metagenome analysis such as this work to "generalist" software that seeks to classify all organisms present in the sample and note the more general utility of a k-mer filter approach when taxonomic boundaries lack clarity or high levels of precision are required.
ISSN:2167-8359
2167-8359
DOI:10.7717/peerj.5515