Refseq Test Subsets for Frame Classification with and without Errors

These test files extend the 'Refseq datasets for training frame classification' dataset. It provided the original test file and three variations containing erroneous sequences to simulate realistic data. The data is based on randomly selected viral and bacterial genomes and the human193(GR...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Voigt, Benjamini, Fischer, Oliver, Krumnow, Christian, Herta, Christian, Dabrowski, Piotr Wojtek
Format: Dataset
Sprache:eng
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:These test files extend the 'Refseq datasets for training frame classification' dataset. It provided the original test file and three variations containing erroneous sequences to simulate realistic data. The data is based on randomly selected viral and bacterial genomes and the human193(GRCh38.p13) reference genome which was downloaded from GenBank. From each original nucleic acid sequence, we created multiple patches of length 300 in all possible reading frames using a sliding window on the initial sequence and its reversed complement. The data is stored in the FASTA format according to the following convention: >{ID}_subsequence{patch index}_frame{frame index}|{class marker}|{frame index} sequence with ID - denotes the ReSeq accession of the original sequence in the Refseq dataset. sequence - nucleic acid sequence patch of length 300 or 250 patch_index - denotes the starting triplet of the given patch within the original sequence or reverse complemented sequence (i.e. 3*patch_index is the starting index of frame 0 in the original sequence) class marker - indicates the taxonomic domain 0 - virus 1 - bacteria 2 - human / mammal frame index - indicates the reading frame 0 - on-frame 1 - shifted by one 2 - shifted by two 3 - reverse complemented 4 - shifted by one and reverse complemented 5 - shifted by two and reverse complemented Each file contains 212.618 patches per frame.
DOI:10.5281/zenodo.5549619