Analysis of NGS data from heterogeneous HCV1b populations
Background: The analysis of next-generation sequencing (NGS) data from heterogeneous viral populations such as hepatitis C virus (HCV) remains a non-trivial exercise. In order to accommodate the specific characteristics of short read fragment data derived from such diverse populations, we studied th...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Other |
Sprache: | eng |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Background: The analysis of next-generation sequencing (NGS) data from heterogeneous viral populations such as hepatitis C virus (HCV) remains a non-trivial exercise. In order to accommodate the specific characteristics of short read fragment data derived from such diverse populations, we studied the impact of tailored analysis methodologies and of reference sequence divergence on the recovery of minority variants.
Materials and methods: Four samples from treatment-naïve HCV1b infected patients were amplified by nested PCR and sequenced with Illumina's Genome Analyzer IIx. The performance of five software packages (MAQ, Bowtie, BWA, Velvet and Segminator II) for aligning the quality trimmed Illumina reads against a reference sequence was tested. To assess the impact of using a related reference sequence, all reads of each sample were mapped to a published HCV1b reference sequence (GenBank: AB049087), to a contig of the same sample obtained by Sanger sequencing and to an in silico reconstructed data-specific reference sequence (VICUNA and V-FAT). Concordance between the viral population obtained after Sanger sequencing and after read mapping with Segminator II using the three mentioned reference sequences, was compared. The observed differences in minority variants between the analysis with a sample-specific and a more distantly related sequence were compared. Positions with high difference (>5% threshold) were compared with the location of high divergent regions, defined as areas with Shannon entropy above the 75th percentile.
Results: The number of mapped reads was consistently higher when a sample-specific sequence was used instead of the standardly used but more divergent HCV1b reference sequence. For the different software packages, recovery of reads varied between 20% and 82% with Segminator II performing best: 80.6% (± 3.0) against the sample-specific Sanger sequence, 77.3% (± 2.4) against the Genbank reference sequence (using the automatic cyclic remapping), and 82.1% (± 4.0) against the in silico reconstructed reference sequence. Simple scoring of aligned positions as matched/unmatched revealed a high degree of concordance between the NGS consensus sequences obtained by Segminator II (all three reference sequences) and the sample-specific sequence obtained with Sanger sequencing. After neglecting the minor discordances (predominantly attributable to ambiguity
characters in the NGS consensus sequences), overall concordance increased to a nearly perf |
---|