Molecular sequence classification using efficient kernel based embedding

The alarming spread of diseases across the globe has become a major concern for global healthcare agencies. The research community is actively involved in inventing better and more efficient ways of detecting and treating diseases to solve this global challenge. The abundance of molecular sequence d...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Information sciences 2024-09, Vol.679, p.121100, Article 121100
Hauptverfasser:	Ali, Sarwan, Ali, Tamkanat E., Murad, Taslim, Mansoor, Haris, Patterson, Murray
Format:	Artikel
Sprache:	eng
Schlagworte:	k-mers Kernel matrix Molecular sequence classification Protein subcellular localization
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The alarming spread of diseases across the globe has become a major concern for global healthcare agencies. The research community is actively involved in inventing better and more efficient ways of detecting and treating diseases to solve this global challenge. The abundance of molecular sequence data has eased the path for researchers to develop Machine Learning (ML) based solutions. The performance of the ML models used to classify molecular sequences depends heavily on the type of embedding used to obtain an appropriate numerical representation of the molecular sequences. In recent years, many embedding approaches have been introduced for molecular sequence analysis. However, there is still a need for improvement as far as the efficiency of the methods is concerned (i.e., the ability to capture pairwise relationships and patterns effectively, which could affect the classification performance). To provide a solution to this problem, we propose an efficient kernel-based technique for embedding generation from molecular sequences, which involves computing a kernel matrix using the Sinkhorn-Knopp algorithm and the normalized pairwise distances between k-mers in a manner that satisfies the constraints of a probability distribution. Further, kernel principal component analysis (PCA) is applied to get the top PCs, which are then used as the final embedding. As a result of the experiments, we obtained an ROC-AUC score of 0.657 for our method, which is higher than the scores obtained using baselines. This clearly shows that the low-dimensional embedding obtained through the proposed approach provides an efficient and effective solution for molecular sequence analysis.
ISSN:	0020-0255 1872-6291
DOI:	10.1016/j.ins.2024.121100