RFGR: Repeat Finder for Complete and Assembled Whole Genomes and NGS Reads

Repetitive DNA sequences cause genomic instability and are important genetic markers. Identification of repeats is a critical step in genome annotation and analysis. On the other hand, repeats also pose a technical challenge for genome assembly and alignment programs using NGS data. RFGR is a compre...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Biochemical genetics 2024-10, Vol.62 (5), p.4157-4173
Hauptverfasser: Sukumaran, Rashmi, Shahina, K., Nair, Achuthsankar S.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Repetitive DNA sequences cause genomic instability and are important genetic markers. Identification of repeats is a critical step in genome annotation and analysis. On the other hand, repeats also pose a technical challenge for genome assembly and alignment programs using NGS data. RFGR is a comprehensive tool that can find exact repetitive sequences in complete genomes and assembled genomes, as well as NGS reads of prokaryotes. For complete genomes, RFGR uses a suffix trees to find seed repeats of repetitive sequences of fixed length with indels. For assembled genomes, RFGR uses a modified Bowtie aligner to find seed repeats of exact repetitive sequences in the contigs/ scaffolds, which are then extended to maximal repeats. The repeats are classified and for repeats near a gene, RFGR reports the gene as well. For the control dataset of E. coli UTI89 and E. coli K12, RFGR reports 35,141 and 49,352 repeats, respectively. For NGS reads, RFGR uses the frequency of the repetitive k-mers to determine FASTQ reads containing repetitive sequences and removes them from the dataset. An E. coli K12 NGS dataset pre-processed using RFGR, on comparison with the original dataset, gives an improved assembly. The N50 value improves by 22.86% with a decrease in size of the assembly graph by nearly 50%. Thus, with RFGR, we achieve a better assembly with reduced computation. RFGR can be improved in terms of the length of the minimum repeat found, extending to find approximate repeats and to be applicable to Eukaryotes as well.
ISSN:0006-2928
1573-4927
1573-4927
DOI:10.1007/s10528-023-10628-x