Inference of Markovian Properties of Molecular Sequences from NGS Data and Applications to Comparative Genomics
Next Generation Sequencing (NGS) technologies generate large amounts of short read data for many different organisms. The fact that NGS reads are generally short makes it challenging to assemble the reads and reconstruct the original genome sequence. For clustering genomes using such NGS data, word-...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Next Generation Sequencing (NGS) technologies generate large amounts of short
read data for many different organisms. The fact that NGS reads are generally
short makes it challenging to assemble the reads and reconstruct the original
genome sequence. For clustering genomes using such NGS data, word-count based
alignment-free sequence comparison is a promising approach, but for this
approach, the underlying expected word counts are essential.
A plausible model for this underlying distribution of word counts is given
through modelling the DNA sequence as a Markov chain (MC). For single long
sequences, efficient statistics are available to estimate the order of MCs and
the transition probability matrix for the sequences. As NGS data do not provide
a single long sequence, inference methods on Markovian properties of sequences
based on single long sequences cannot be directly used for NGS short read data.
Here we derive a normal approximation for such word counts. We also show that
the traditional Chi-square statistic has an approximate gamma distribution,
using the Lander-Waterman model for physical mapping. We propose several
methods to estimate the order of the MC based on NGS reads and evaluate them
using simulations. We illustrate the applications of our results by clustering
genomic sequences of several vertebrate and tree species based on NGS reads
using alignment-free sequence dissimilarity measures. We find that the
estimated order of the MC has a considerable effect on the clustering results,
and that the clustering results that use a MC of the estimated order give a
plausible clustering of the species. |
---|---|
DOI: | 10.48550/arxiv.1504.00959 |