Molecular sequence classification using efficient kernel based embedding
The alarming spread of diseases across the globe has become a major concern for global healthcare agencies. The research community is actively involved in inventing better and more efficient ways of detecting and treating diseases to solve this global challenge. The abundance of molecular sequence d...
Gespeichert in:
Veröffentlicht in: | Information sciences 2024-09, Vol.679, p.121100, Article 121100 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The alarming spread of diseases across the globe has become a major concern for global healthcare agencies. The research community is actively involved in inventing better and more efficient ways of detecting and treating diseases to solve this global challenge. The abundance of molecular sequence data has eased the path for researchers to develop Machine Learning (ML) based solutions. The performance of the ML models used to classify molecular sequences depends heavily on the type of embedding used to obtain an appropriate numerical representation of the molecular sequences. In recent years, many embedding approaches have been introduced for molecular sequence analysis. However, there is still a need for improvement as far as the efficiency of the methods is concerned (i.e., the ability to capture pairwise relationships and patterns effectively, which could affect the classification performance). To provide a solution to this problem, we propose an efficient kernel-based technique for embedding generation from molecular sequences, which involves computing a kernel matrix using the Sinkhorn-Knopp algorithm and the normalized pairwise distances between k-mers in a manner that satisfies the constraints of a probability distribution. Further, kernel principal component analysis (PCA) is applied to get the top PCs, which are then used as the final embedding. As a result of the experiments, we obtained an ROC-AUC score of 0.657 for our method, which is higher than the scores obtained using baselines. This clearly shows that the low-dimensional embedding obtained through the proposed approach provides an efficient and effective solution for molecular sequence analysis. |
---|---|
ISSN: | 0020-0255 1872-6291 |
DOI: | 10.1016/j.ins.2024.121100 |