NASCUP: Nucleic Acid Sequence Classification by Universal Probability

Nucleic acid sequence classification is a fundamental task in the field of bioinformatics. Due to the increasing amount of unlabeled nucleotide sequences, fast and accurate classification of them on a large scale has become crucial. In this work, we developed NASCUP, a new classification method that...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2021, Vol.9, p.162779-162791
Hauptverfasser:	Kwon, Sunyoung, Kim, Gyuwan, Lee, Byunghan, Chun, Jongsik, Yoon, Sungroh, Kim, Young-Han
Format:	Artikel
Sprache:	eng
Schlagworte:	Bioinformatics Classification Context modeling context-tree models Data analysis Data models Hidden Markov models Information theory Markov processes Maximum likelihood estimation Microorganisms Nucleic acids Nucleotides Outliers (statistics) Probability Ribonucleic acid RNA sequence classification Statistical analysis Taxonomy universal probability
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Nucleic acid sequence classification is a fundamental task in the field of bioinformatics. Due to the increasing amount of unlabeled nucleotide sequences, fast and accurate classification of them on a large scale has become crucial. In this work, we developed NASCUP, a new classification method that captures statistical structures of nucleotide sequences by compact context-tree models and universal probability from information theory. A comprehensive experimental study involving nine public databases for functional non-coding RNA, microbial taxonomy and coding/non-coding RNA classification demonstrates the advantages of NASCUP over widely-used alternatives in efficiency, accuracy, and scalability across all datasets considered. NASCUP achieved BLAST-like classification accuracy consistently for several large-scale databases in orders-of-magnitude reduced runtime, and was applied to other bioinformatics tasks such as outlier detection and synthetic sequence generation.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2021.3127957